[jira] [Updated] (HDFS-1314) dfs.blocksize accepts only absolute value

2011-12-30 Thread Eli Collins (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-1314:
--

Summary: dfs.blocksize accepts only absolute value  (was: dfs.block.size 
accepts only absolute value)

> dfs.blocksize accepts only absolute value
> -
>
> Key: HDFS-1314
> URL: https://issues.apache.org/jira/browse/HDFS-1314
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Karim Saadah
>Assignee: Sho Shimauchi
>Priority: Minor
>  Labels: newbie
> Attachments: hdfs-1314.txt, hdfs-1314.txt, hdfs-1314.txt
>
>
> Using "dfs.block.size=8388608" works 
> but "dfs.block.size=8mb" does not.
> Using "dfs.block.size=8mb" should throw some WARNING on NumberFormatException.
> (http://pastebin.corp.yahoo.com/56129)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2739) SecondaryNameNode doesn't start up

2011-12-30 Thread Harsh J (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2739:
--

Affects Version/s: 0.24.0

> SecondaryNameNode doesn't start up
> --
>
> Key: HDFS-2739
> URL: https://issues.apache.org/jira/browse/HDFS-2739
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.24.0
>Reporter: Sho Shimauchi
>Priority: Critical
>
> Built a 0.24-SNAPSHOT tar from today, used a general config, started NN/DN, 
> but SNN won't come up with following error:
> {code}
> 11/12/31 12:13:14 ERROR namenode.SecondaryNameNode: Throwable Exception in 
> doCheckpoint
> java.lang.RuntimeException: java.lang.NoSuchFieldException: versionID
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:154)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invocation.(WritableRpcEngine.java:112)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:226)
>   at $Proxy9.getTransationId(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.getTransactionID(NamenodeProtocolTranslatorPB.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.countUncheckpointedTxns(SecondaryNameNode.java:625)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.shouldCheckpointBasedOnCount(SecondaryNameNode.java:633)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:386)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:356)
>   at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.NoSuchFieldException: versionID
>   at java.lang.Class.getField(Class.java:1520)
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:150)
>   ... 9 more
> java.lang.RuntimeException: java.lang.NoSuchFieldException: versionID
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:154)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invocation.(WritableRpcEngine.java:112)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:226)
>   at $Proxy9.getTransationId(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.getTransactionID(NamenodeProtocolTranslatorPB.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.countUncheckpointedTxns(SecondaryNameNode.java:625)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.shouldCheckpointBasedOnCount(SecondaryNameNode.java:633)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:386)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:356)
>   at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.NoSuchFieldException: versionID
>   at java.lang.Class.getField(Class.java:1520)
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:150)
>   ... 9 more
> 11/12/31 12:13:14 INFO namenode.SecondaryNameNode: SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down SecondaryNameNode at sho-mba.local/192.168.11.2
> /
> {code}
> full error log: http://pastebin.com/mSaVbS34

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2739) SecondaryNameNode doesn't start up

2011-12-30 Thread Harsh J (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177883#comment-13177883
 ] 

Harsh J commented on HDFS-2739:
---

Also looks like a protocol or call somewhere seems to have a typo in it, i.e. 
{{getTransationId}}. We could fix it as part of this itself perhaps.

> SecondaryNameNode doesn't start up
> --
>
> Key: HDFS-2739
> URL: https://issues.apache.org/jira/browse/HDFS-2739
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Sho Shimauchi
>Priority: Critical
>
> Built a 0.24-SNAPSHOT tar from today, used a general config, started NN/DN, 
> but SNN won't come up with following error:
> {code}
> 11/12/31 12:13:14 ERROR namenode.SecondaryNameNode: Throwable Exception in 
> doCheckpoint
> java.lang.RuntimeException: java.lang.NoSuchFieldException: versionID
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:154)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invocation.(WritableRpcEngine.java:112)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:226)
>   at $Proxy9.getTransationId(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.getTransactionID(NamenodeProtocolTranslatorPB.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.countUncheckpointedTxns(SecondaryNameNode.java:625)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.shouldCheckpointBasedOnCount(SecondaryNameNode.java:633)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:386)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:356)
>   at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.NoSuchFieldException: versionID
>   at java.lang.Class.getField(Class.java:1520)
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:150)
>   ... 9 more
> java.lang.RuntimeException: java.lang.NoSuchFieldException: versionID
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:154)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invocation.(WritableRpcEngine.java:112)
>   at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:226)
>   at $Proxy9.getTransationId(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.getTransactionID(NamenodeProtocolTranslatorPB.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.countUncheckpointedTxns(SecondaryNameNode.java:625)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.shouldCheckpointBasedOnCount(SecondaryNameNode.java:633)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:386)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:356)
>   at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.NoSuchFieldException: versionID
>   at java.lang.Class.getField(Class.java:1520)
>   at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:150)
>   ... 9 more
> 11/12/31 12:13:14 INFO namenode.SecondaryNameNode: SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down SecondaryNameNode at sho-mba.local/192.168.11.2
> /
> {code}
> full error log: http://pastebin.com/mSaVbS34

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-2739) SecondaryNameNode doesn't start up

2011-12-30 Thread Sho Shimauchi (Created) (JIRA)
SecondaryNameNode doesn't start up
--

 Key: HDFS-2739
 URL: https://issues.apache.org/jira/browse/HDFS-2739
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Sho Shimauchi
Priority: Critical


Built a 0.24-SNAPSHOT tar from today, used a general config, started NN/DN, but 
SNN won't come up with following error:

{code}
11/12/31 12:13:14 ERROR namenode.SecondaryNameNode: Throwable Exception in 
doCheckpoint
java.lang.RuntimeException: java.lang.NoSuchFieldException: versionID
at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:154)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Invocation.(WritableRpcEngine.java:112)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:226)
at $Proxy9.getTransationId(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.getTransactionID(NamenodeProtocolTranslatorPB.java:185)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.countUncheckpointedTxns(SecondaryNameNode.java:625)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.shouldCheckpointBasedOnCount(SecondaryNameNode.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:386)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:356)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.NoSuchFieldException: versionID
at java.lang.Class.getField(Class.java:1520)
at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:150)
... 9 more
java.lang.RuntimeException: java.lang.NoSuchFieldException: versionID
at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:154)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Invocation.(WritableRpcEngine.java:112)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:226)
at $Proxy9.getTransationId(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.getTransactionID(NamenodeProtocolTranslatorPB.java:185)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.countUncheckpointedTxns(SecondaryNameNode.java:625)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.shouldCheckpointBasedOnCount(SecondaryNameNode.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:386)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:356)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.NoSuchFieldException: versionID
at java.lang.Class.getField(Class.java:1520)
at org.apache.hadoop.ipc.RPC.getProtocolVersion(RPC.java:150)
... 9 more
11/12/31 12:13:14 INFO namenode.SecondaryNameNode: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down SecondaryNameNode at sho-mba.local/192.168.11.2
/
{code}

full error log: http://pastebin.com/mSaVbS34




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2291) HA: Checkpointing in an HA setup

2011-12-30 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2291:
--

Attachment: hdfs-2291.txt

New iteration of the patch handles canceling checkpoints if a failover happens 
in the middle of the process.

There is still one very narrow race where the cancellation doesn't work -- but 
it won't cause a crash or anything, just a delayed failover. I'd like to 
address that in a followup since it wasn't simple to fix.

I took the liberty of refactoring some common configuration code out into a new 
CheckpointConf class.. hope that's OK.

I'll probably upload one more rev of this patch with better comments/javadoc, 
but this should be ready for general review.

> HA: Checkpointing in an HA setup
> 
>
> Key: HDFS-2291
> URL: https://issues.apache.org/jira/browse/HDFS-2291
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2291.txt, hdfs-2291.txt
>
>
> We obviously need to create checkpoints when HA is enabled. One thought is to 
> use a third, dedicated checkpointing node in addition to the active and 
> standby nodes. Another option would be to make the standby capable of also 
> performing the function of checkpointing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2716) HA: Configuration needs to allow different dfs.http.addresses for each HA NN

2011-12-30 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2716.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: Configuration needs to allow different dfs.http.addresses for each HA NN
> 
>
> Key: HDFS-2716
> URL: https://issues.apache.org/jira/browse/HDFS-2716
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2716.txt, hdfs-2716.txt
>
>
> Earlier on the HA branch we expanded the configuration so that different IPC 
> addresses can be specified for each of the HA NNs in a cluster. But we didn't 
> do this for the HTTP address. This has proved problematic while working on 
> HDFS-2291 (checkpointing in HA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2716) HA: Configuration needs to allow different dfs.http.addresses for each HA NN

2011-12-30 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2716:
--

Attachment: hdfs-2716.txt

Fixed the javadoc. Also added a missing import in SecondaryNameNode.java 
(forgot to import HAUtil). I checked that it compiles and the new test still 
passes. Will commit momentarily. Thanks, atm

> HA: Configuration needs to allow different dfs.http.addresses for each HA NN
> 
>
> Key: HDFS-2716
> URL: https://issues.apache.org/jira/browse/HDFS-2716
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2716.txt, hdfs-2716.txt
>
>
> Earlier on the HA branch we expanded the configuration so that different IPC 
> addresses can be specified for each of the HA NNs in a cluster. But we didn't 
> do this for the HTTP address. This has proved problematic while working on 
> HDFS-2291 (checkpointing in HA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2737) HA: Automatically trigger log rolls periodically on the active NN

2011-12-30 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177869#comment-13177869
 ] 

Todd Lipcon commented on HDFS-2737:
---

bq. Is it worth considering supporting tailing non-finalized logs?

Worth considering, but as I understand from Ivan and Jitendra, BK doesn't 
support this functionality yet. Since it's simpler to just cause frequent 
rolls, we may as well do it this way and also solve the problem for BK at the 
same time, IMO.

> HA: Automatically trigger log rolls periodically on the active NN
> -
>
> Key: HDFS-2737
> URL: https://issues.apache.org/jira/browse/HDFS-2737
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>
> Currently, the edit log tailing process can only read finalized log segments. 
> So, if the active NN is not rolling its logs periodically, the SBN will lag a 
> lot. This also causes many datanode messages to be queued up in the 
> PendingDatanodeMessage structure.
> To combat this, the active NN needs to roll its logs periodically -- perhaps 
> based on a time threshold, or perhaps based on a number of transactions. I'm 
> not sure yet whether it's better to have the NN roll on its own or to have 
> the SBN ask the active NN to roll its logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2737) HA: Automatically trigger log rolls periodically on the active NN

2011-12-30 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177866#comment-13177866
 ] 

Todd Lipcon commented on HDFS-2737:
---

A couple options here:

*1) Add a thread to the NN which rolls periodically (based on time or # txns)*

This would be advantageous if we had some use cases for keeping edit log 
segments short even absent HA. The only case Aaron and I could brainstorm would 
be for backups, where it's a little easier to backup a finalized file compared 
to a rolling one. But we can satisfy this easily by adding a command line tool 
to trigger a roll, which a backup script can use. So it's not super compelling.

2) Add a new thread to the SBN which makes an IPC to the active and asks it to 
roll periodically

Advantage here is simplicity.

3) Add some code to the EditLogTailer thread in the SBN which makes a call to 
the active NN to trigger a roll when necessary (eg when the 
PendingDatanodeMessage queue is too large, or it's been too long since it has 
read any edits).

Advantage here is that the real motivation for the rolls is the EditLogTailer 
itself. We want to keep lag low (for fast recovery) and also keep the pending 
datanode queue small (to fit within memory bounds). By putting the trigger 
here, we can directly inspect those two variables, and trigger rolls when 
necessary.

So I'm thinking option 3 is the best.

> HA: Automatically trigger log rolls periodically on the active NN
> -
>
> Key: HDFS-2737
> URL: https://issues.apache.org/jira/browse/HDFS-2737
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>
> Currently, the edit log tailing process can only read finalized log segments. 
> So, if the active NN is not rolling its logs periodically, the SBN will lag a 
> lot. This also causes many datanode messages to be queued up in the 
> PendingDatanodeMessage structure.
> To combat this, the active NN needs to roll its logs periodically -- perhaps 
> based on a time threshold, or perhaps based on a number of transactions. I'm 
> not sure yet whether it's better to have the NN roll on its own or to have 
> the SBN ask the active NN to roll its logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-2738) FSEditLog.selectinputStreams is reading through in-progress streams even when non-in-progress are requested

2011-12-30 Thread Todd Lipcon (Created) (JIRA)
FSEditLog.selectinputStreams is reading through in-progress streams even when 
non-in-progress are requested
---

 Key: HDFS-2738
 URL: https://issues.apache.org/jira/browse/HDFS-2738
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical


The new code in HDFS-1580 is causing an issue with selectInputStreams in the HA 
context. When the active is writing to the shared edits, selectInputStreams is 
called on the standby. This ends up calling {{journalSet.getInputStream}} but 
doesn't pass the {{inProgressOk=false}} flag. So, {{getInputStream}} ends up 
reading and validating the in-progress stream unnecessarily. Since the 
validation results are no longer properly cached, {{findMaxTransaction}} also 
re-validates the in-progress stream, and then breaks the corruption check in 
this code. The end result is a lot of errors like:

2011-12-30 16:45:02,521 ERROR namenode.FileJournalManager 
(FileJournalManager.java:getNumberOfTransactions(266)) - Gap in transactions, 
max txnid is 579, 0 txns from 578
2011-12-30 16:45:02,521 INFO  ha.EditLogTailer (EditLogTailer.java:run(163)) - 
Got error, will try again.
java.io.IOException: No non-corrupt logs for txid 578
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.getInputStream(JournalSet.java:229)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1081)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:115)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$0(EditLogTailer.java:100)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:154)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2716) HA: Configuration needs to allow different dfs.http.addresses for each HA NN

2011-12-30 Thread Aaron T. Myers (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177857#comment-13177857
 ] 

Aaron T. Myers commented on HDFS-2716:
--

{code}
+   * @param object
{code}

That's not the most helpful javadoc comment. Other than that, +1.

> HA: Configuration needs to allow different dfs.http.addresses for each HA NN
> 
>
> Key: HDFS-2716
> URL: https://issues.apache.org/jira/browse/HDFS-2716
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-2716.txt
>
>
> Earlier on the HA branch we expanded the configuration so that different IPC 
> addresses can be specified for each of the HA NNs in a cluster. But we didn't 
> do this for the HTTP address. This has proved problematic while working on 
> HDFS-2291 (checkpointing in HA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2736) HA: support 2NN with SBN

2011-12-30 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177858#comment-13177858
 ] 

Todd Lipcon commented on HDFS-2736:
---

Agreed. Currently we also basically support multiple-standby. Even though it 
wasn't a goal of the implementation, it basically fell out for free.

> HA: support 2NN with SBN
> 
>
> Key: HDFS-2736
> URL: https://issues.apache.org/jira/browse/HDFS-2736
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>
> HDFS-2291 adds support for making the SBN capable of checkpointing, seems 
> like we may also need to support the 2NN checkpointing as well. Eg if we fail 
> over to the SBN does it continue to checkpoint? If not the log grows 
> unbounded until the old primary comes back, if so does that create 
> performance problems since the primary wasn't previously checkpointing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2736) HA: support 2NN with SBN

2011-12-30 Thread Eli Collins (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-2736:
--

Summary: HA: support 2NN with SBN  (was: HA: support separate SBN and 2NN?)

> HA: support 2NN with SBN
> 
>
> Key: HDFS-2736
> URL: https://issues.apache.org/jira/browse/HDFS-2736
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>
> HDFS-2291 adds support for making the SBN capable of checkpointing, seems 
> like we may also need to support the 2NN checkpointing as well. Eg if we fail 
> over to the SBN does it continue to checkpoint? If not the log grows 
> unbounded until the old primary comes back, if so does that create 
> performance problems since the primary wasn't previously checkpointing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2736) HA: support separate SBN and 2NN?

2011-12-30 Thread Eli Collins (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins resolved HDFS-2736.
---

Resolution: Won't Fix

I was thinking about the case where the former active remains dead for some 
time. But this case is problematic for a number of other reasons (eg with just 
a single host we can't failback if necessary) so I think its reasonable to 
require users start another SBN instead also deploying a 2NN. Closing as won't 
fix, can re-open if others disagree.

> HA: support separate SBN and 2NN?
> -
>
> Key: HDFS-2736
> URL: https://issues.apache.org/jira/browse/HDFS-2736
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>
> HDFS-2291 adds support for making the SBN capable of checkpointing, seems 
> like we may also need to support the 2NN checkpointing as well. Eg if we fail 
> over to the SBN does it continue to checkpoint? If not the log grows 
> unbounded until the old primary comes back, if so does that create 
> performance problems since the primary wasn't previously checkpointing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2737) HA: Automatically trigger log rolls periodically on the active NN

2011-12-30 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177842#comment-13177842
 ] 

Eli Collins commented on HDFS-2737:
---

Is it worth considering supporting tailing non-finalized logs?

> HA: Automatically trigger log rolls periodically on the active NN
> -
>
> Key: HDFS-2737
> URL: https://issues.apache.org/jira/browse/HDFS-2737
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>
> Currently, the edit log tailing process can only read finalized log segments. 
> So, if the active NN is not rolling its logs periodically, the SBN will lag a 
> lot. This also causes many datanode messages to be queued up in the 
> PendingDatanodeMessage structure.
> To combat this, the active NN needs to roll its logs periodically -- perhaps 
> based on a time threshold, or perhaps based on a number of transactions. I'm 
> not sure yet whether it's better to have the NN roll on its own or to have 
> the SBN ask the active NN to roll its logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

2011-12-30 Thread Aaron T. Myers (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-2709:
-

Attachment: HDFS-2709-HDFS-1623.patch

Here's a patch which addresses the first two points from above. I'm still 
working on adding some tests for {{TestFileJournalManager}}, but this is worth 
reviewing in the mean time.

> HA: Appropriately handle error conditions in EditLogTailer
> --
>
> Key: HDFS-2709
> URL: https://issues.apache.org/jira/browse/HDFS-2709
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1314) dfs.block.size accepts only absolute value

2011-12-30 Thread Sho Shimauchi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sho Shimauchi updated HDFS-1314:


Status: Patch Available  (was: Open)

> dfs.block.size accepts only absolute value
> --
>
> Key: HDFS-1314
> URL: https://issues.apache.org/jira/browse/HDFS-1314
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Karim Saadah
>Assignee: Sho Shimauchi
>Priority: Minor
>  Labels: newbie
> Attachments: hdfs-1314.txt, hdfs-1314.txt, hdfs-1314.txt
>
>
> Using "dfs.block.size=8388608" works 
> but "dfs.block.size=8mb" does not.
> Using "dfs.block.size=8mb" should throw some WARNING on NumberFormatException.
> (http://pastebin.corp.yahoo.com/56129)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1314) dfs.block.size accepts only absolute value

2011-12-30 Thread Sho Shimauchi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sho Shimauchi updated HDFS-1314:


Status: Open  (was: Patch Available)

> dfs.block.size accepts only absolute value
> --
>
> Key: HDFS-1314
> URL: https://issues.apache.org/jira/browse/HDFS-1314
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Karim Saadah
>Assignee: Sho Shimauchi
>Priority: Minor
>  Labels: newbie
> Attachments: hdfs-1314.txt, hdfs-1314.txt, hdfs-1314.txt
>
>
> Using "dfs.block.size=8388608" works 
> but "dfs.block.size=8mb" does not.
> Using "dfs.block.size=8mb" should throw some WARNING on NumberFormatException.
> (http://pastebin.corp.yahoo.com/56129)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-2737) HA: Automatically trigger log rolls periodically on the active NN

2011-12-30 Thread Todd Lipcon (Created) (JIRA)
HA: Automatically trigger log rolls periodically on the active NN
-

 Key: HDFS-2737
 URL: https://issues.apache.org/jira/browse/HDFS-2737
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Affects Versions: HA branch (HDFS-1623)
Reporter: Todd Lipcon
Assignee: Todd Lipcon


Currently, the edit log tailing process can only read finalized log segments. 
So, if the active NN is not rolling its logs periodically, the SBN will lag a 
lot. This also causes many datanode messages to be queued up in the 
PendingDatanodeMessage structure.

To combat this, the active NN needs to roll its logs periodically -- perhaps 
based on a time threshold, or perhaps based on a number of transactions. I'm 
not sure yet whether it's better to have the NN roll on its own or to have the 
SBN ask the active NN to roll its logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

2011-12-30 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177792#comment-13177792
 ] 

Eli Collins commented on HDFS-2709:
---

Sounds good to me.

> HA: Appropriately handle error conditions in EditLogTailer
> --
>
> Key: HDFS-2709
> URL: https://issues.apache.org/jira/browse/HDFS-2709
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2736) HA: support separate SBN and 2NN?

2011-12-30 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177791#comment-13177791
 ] 

Todd Lipcon commented on HDFS-2736:
---

bq. if we fail over to the SBN does it continue to checkpoint?

If the only active NN comes back, it will start in standby mode and then do 
checkpoints against the other one. It's only if the former active remains dead 
for some time that we might have a problem.

bq. If not the log grows unbounded until the old primary comes back, if so does 
that create performance problems since the primary wasn't previously 
checkpointing?
Certainly the first startup after a long outage will be slower, since it has to 
replay a lot of transactions. But it's not any different than what happens 
today if the 2NN goes down for some length of time.

Given that people currently run successfully with only a single 2NN, I don't 
think this is particularly high priority. Do you disagree?

> HA: support separate SBN and 2NN?
> -
>
> Key: HDFS-2736
> URL: https://issues.apache.org/jira/browse/HDFS-2736
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>
> HDFS-2291 adds support for making the SBN capable of checkpointing, seems 
> like we may also need to support the 2NN checkpointing as well. Eg if we fail 
> over to the SBN does it continue to checkpoint? If not the log grows 
> unbounded until the old primary comes back, if so does that create 
> performance problems since the primary wasn't previously checkpointing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2731) Autopopulate standby name dirs if they're empty

2011-12-30 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177788#comment-13177788
 ] 

Eli Collins commented on HDFS-2731:
---

Thanks. Filed HDFS-2736 to figure it out.

> Autopopulate standby name dirs if they're empty
> ---
>
> Key: HDFS-2731
> URL: https://issues.apache.org/jira/browse/HDFS-2731
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>Assignee: Eli Collins
>
> To setup a SBN we currently format the primary then manually copy the name 
> dirs to the SBN. The SBN should do this automatically. Specifically, on NN 
> startup, if HA with a shared edits dir is configured and populated, if the 
> SBN has empty name dirs it should downloads the image and log from the 
> primary (as an optimization it could copy the logs from the shared dir). If 
> the other NN is still in standby then it should fail to start as it does 
> currently.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-2736) HA: support separate SBN and 2NN?

2011-12-30 Thread Eli Collins (Created) (JIRA)
HA: support separate SBN and 2NN?
-

 Key: HDFS-2736
 URL: https://issues.apache.org/jira/browse/HDFS-2736
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins


HDFS-2291 adds support for making the SBN capable of checkpointing, seems like 
we may also need to support the 2NN checkpointing as well. Eg if we fail over 
to the SBN does it continue to checkpoint? If not the log grows unbounded until 
the old primary comes back, if so does that create performance problems since 
the primary wasn't previously checkpointing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

2011-12-30 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177761#comment-13177761
 ] 

Todd Lipcon commented on HDFS-2709:
---

Aaron and I chatted offline about the above questions a little bit. We think 
the following is the best route forward:
- Instead of adding a new constructor to ELFIS, add a new "seekToTxnId" method 
which FileJournalManager can call after constructing it. (the reasoning being 
that this is more similar to the normal Java FileInputStream which has a 
separate seek() call)
- In FSEditLogLoader, we decided that the custom exception would make the most 
sense -- i.e wrap the {{readOp}} call in a {{try/catch}} which would rethrow 
the exception with some kind of new {{EditLogInputException}}. The new 
exception would also have a getter to determine how many txns were successfully 
applied prior to the error. This is similar to how InterruptedIOException works 
in the standard library.
- Regarding tests, the suggestion was to add some new test cases to 
{{TestFileJournalManager}} to exercise the new code in {{selectInputStreams}}.

> HA: Appropriately handle error conditions in EditLogTailer
> --
>
> Key: HDFS-2709
> URL: https://issues.apache.org/jira/browse/HDFS-2709
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-2735) HA: add tests for multiple shared edits dirs

2011-12-30 Thread Eli Collins (Created) (JIRA)
HA: add tests for multiple shared edits dirs


 Key: HDFS-2735
 URL: https://issues.apache.org/jira/browse/HDFS-2735
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, test
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins


You can configure and run with multiple shared edits dirs but we don't have any 
test coverage for them. In particular, we should cover the behavior of the edit 
log tailer with multiple dirs, and failure scenarios (eg can we tolerate a 
single shared dir failure if we have two shared dirs).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

2011-12-30 Thread Aaron T. Myers (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177735#comment-13177735
 ] 

Aaron T. Myers commented on HDFS-2709:
--

bq. Rather than modify EditLogFileInputStream to take a startTxId, why not do 
the "skipping" (what you call setInitialPosition) from the caller? ie modify 
FSEditLogLoader to skip the transactions that have already been replayed? The 
skipping code doesn't seem specific to the input stream itself.

What I did seems cleaner to me. We're necessarily changing the code which 
selects streams to allow a request for a starting txid in the middle of an ELF, 
so why should that return an ELFIS which starts at a lower txid?

bq. I'm not convinced why we need to have the partialLoadOk flag in 
FSEditLogLoader. IMO if the log is truncated, it's still an error as far as the 
loader is concerned - we just want to let the caller continue from where the 
error occured. The only trick is how to go about getting the last successfully 
loaded txid out of the FSEditLogLoader in the error case – I guess a member 
variable and a getter would work there? Do you think this ends up messier than 
the way you've done it?

I considered that. I also considered throwing a custom {{Exception}} which 
includes the last successfully-loaded txid. Both of those seemed more messy 
than the way I did it, but I could probably be convinced otherwise.

Note that the way I did it in this patch does not preclude the 
{{EditLogTailer}} from detecting that not all expected transactions were 
loaded. The {{EditLogTailer}} already knows both how many are available from 
the files in the shared dir, and how many transactions were in fact loaded. 
This would allow one to implement Eli's suggestion of "retry a read failure X 
times and then exit," though this patch does not currently do that.

bq. Can we add some non-HA tests that exercise 
FileJournalManager/FSEditLogLoader's ability to start mid-stream? Not sure if 
that's feasible.

I'm not quite sure what you mean by this. The way the code is currently 
structured, the code for continuing from the middle of an ELF will only be 
reached in an HA context. That's the point of the {{partialLoadOk}} option, 
which is only passed as true when HA is enabled.

> HA: Appropriately handle error conditions in EditLogTailer
> --
>
> Key: HDFS-2709
> URL: https://issues.apache.org/jira/browse/HDFS-2709
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2731) Autopopulate standby name dirs if they're empty

2011-12-30 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177732#comment-13177732
 ] 

Todd Lipcon commented on HDFS-2731:
---

We don't currently support using the 2NN with HA -- we'd need to improve it to 
upload the checkpointed images to both NNs. If we think this is critical we 
should file another JIRA for it... in HDFS-2291 I actually have added a check 
to the 2NN startup that makes it fail to start if HA is enabled.

> Autopopulate standby name dirs if they're empty
> ---
>
> Key: HDFS-2731
> URL: https://issues.apache.org/jira/browse/HDFS-2731
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>Assignee: Eli Collins
>
> To setup a SBN we currently format the primary then manually copy the name 
> dirs to the SBN. The SBN should do this automatically. Specifically, on NN 
> startup, if HA with a shared edits dir is configured and populated, if the 
> SBN has empty name dirs it should downloads the image and log from the 
> primary (as an optimization it could copy the logs from the shared dir). If 
> the other NN is still in standby then it should fail to start as it does 
> currently.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

2011-12-30 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177729#comment-13177729
 ] 

Eli Collins commented on HDFS-2709:
---

bq.  Rather than modify EditLogFileInputStream to take a startTxId, why not do 
the  "skipping" (what you call setInitialPosition) from the caller?

Had the same thought, was thinking it was internal so that in the future we 
could skip ahead w/o actually reading the logs, but no reason we can't have a 
skip API and use that outside ELFIS as well.


> HA: Appropriately handle error conditions in EditLogTailer
> --
>
> Key: HDFS-2709
> URL: https://issues.apache.org/jira/browse/HDFS-2709
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2731) Autopopulate standby name dirs if they're empty

2011-12-30 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177727#comment-13177727
 ] 

Eli Collins commented on HDFS-2731:
---

Don't we support using both the SBN and 2NN though?  There are issues requiring 
the SBN be the checkpointer, eg no one is doing checkpointing if the primary 
dies and you fail-over or the SBN still checkpoints but doesn't have enough 
memory.

> Autopopulate standby name dirs if they're empty
> ---
>
> Key: HDFS-2731
> URL: https://issues.apache.org/jira/browse/HDFS-2731
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>Assignee: Eli Collins
>
> To setup a SBN we currently format the primary then manually copy the name 
> dirs to the SBN. The SBN should do this automatically. Specifically, on NN 
> startup, if HA with a shared edits dir is configured and populated, if the 
> SBN has empty name dirs it should downloads the image and log from the 
> primary (as an optimization it could copy the logs from the shared dir). If 
> the other NN is still in standby then it should fail to start as it does 
> currently.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-229) Show last checkpoint time on the webUI

2011-12-30 Thread Sho Shimauchi (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sho Shimauchi reassigned HDFS-229:
--

Assignee: Sho Shimauchi

> Show last checkpoint time on the webUI
> --
>
> Key: HDFS-229
> URL: https://issues.apache.org/jira/browse/HDFS-229
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Koji Noguchi
>Assignee: Sho Shimauchi
>Priority: Minor
>
> We've had a couple of occasions where secondary namenode fail/die without 
> anyone noticing. (HADOOP-1695)
> It would be nice if the last checkpoint time can be shown on the namenode 
> webui.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2729) Update BlockManager's comments regarding the invalid block set

2011-12-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177655#comment-13177655
 ] 

Hudson commented on HDFS-2729:
--

Integrated in Hadoop-Mapreduce-trunk #943 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/943/])
HDFS-2729. Update BlockManager's comments regarding the invalid block set 
(harsh)

harsh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225591
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java


> Update BlockManager's comments regarding the invalid block set
> --
>
> Key: HDFS-2729
> URL: https://issues.apache.org/jira/browse/HDFS-2729
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Fix For: 0.24.0
>
> Attachments: HDFS-2729.patch
>
>
> Looks like after HDFS-82 was covered at some point, the comments and logs 
> still carry presence of two sets when there really is just one set.
> This patch changes the logs and comments to be more accurate about that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-229) Show last checkpoint time on the webUI

2011-12-30 Thread Harsh J (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177646#comment-13177646
 ] 

Harsh J commented on HDFS-229:
--

+1 for last chkpt time in report (and web UI) 
+1 for size of image/edits in web UI

(Do check if this is already present today)

> Show last checkpoint time on the webUI
> --
>
> Key: HDFS-229
> URL: https://issues.apache.org/jira/browse/HDFS-229
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Koji Noguchi
>Priority: Minor
>
> We've had a couple of occasions where secondary namenode fail/die without 
> anyone noticing. (HADOOP-1695)
> It would be nice if the last checkpoint time can be shown on the namenode 
> webui.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2729) Update BlockManager's comments regarding the invalid block set

2011-12-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177633#comment-13177633
 ] 

Hudson commented on HDFS-2729:
--

Integrated in Hadoop-Hdfs-trunk #910 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/910/])
HDFS-2729. Update BlockManager's comments regarding the invalid block set 
(harsh)

harsh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225591
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java


> Update BlockManager's comments regarding the invalid block set
> --
>
> Key: HDFS-2729
> URL: https://issues.apache.org/jira/browse/HDFS-2729
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Fix For: 0.24.0
>
> Attachments: HDFS-2729.patch
>
>
> Looks like after HDFS-82 was covered at some point, the comments and logs 
> still carry presence of two sets when there really is just one set.
> This patch changes the logs and comments to be more accurate about that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2714) HA: Fix test cases which use standalone FSNamesystems

2011-12-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177631#comment-13177631
 ] 

Hudson commented on HDFS-2714:
--

Integrated in Hadoop-Hdfs-HAbranch-build #32 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/32/])
HDFS-2714. Fix test cases which use standalone FSNamesystems. Contributed 
by Todd Lipcon.

todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225708
Files : 
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogRace.java


> HA: Fix test cases which use standalone FSNamesystems
> -
>
> Key: HDFS-2714
> URL: https://issues.apache.org/jira/browse/HDFS-2714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2714.txt
>
>
> Several tests (eg TestEditLog, TestSaveNamespace) failed in the most recent 
> build with an NPE inside of FSNamesystem.checkOperation. These tests set up a 
> standalone FSN that isn't fully initialized. We just need to add a null check 
> to deal with this case in checkOperation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2692) HA: Bugs related to failover from/into safe-mode

2011-12-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177630#comment-13177630
 ] 

Hudson commented on HDFS-2692:
--

Integrated in Hadoop-Hdfs-HAbranch-build #32 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/32/])
HDFS-2692. Fix bugs related to failover from/into safe mode. Contributed by 
Todd Lipcon.

todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225709
Files : 
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/Checkpointer.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java
* 
/hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java


> HA: Bugs related to failover from/into safe-mode
> 
>
> Key: HDFS-2692
> URL: https://issues.apache.org/jira/browse/HDFS-2692
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2692.txt, hdfs-2692.txt, hdfs-2692.txt
>
>
> In testing I saw an AssertionError come up several times when I was trying to 
> do failover between two NNs where one or the other was in safe-mode. Need to 
> write some unit tests to try to trigger this -- hunch is it has something to 
> do with the treatment of "safe block count" while tailing edits in safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira