[jira] [Commented] (HDDS-4308) Fix issue with quota update
[ https://issues.apache.org/jira/browse/HDDS-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215148#comment-17215148 ] Yiqun Lin commented on HDDS-4308: - {quote}This might not be complete i believe, If 2 threads acquire copy object and if they update outside lock we have issue again. I think the whole operation should be performed under volume lock. (As we update in-memory it should be quick) But i agree that it might have performance impact across buckets when key writes happen. {quote} Use volume lock during bucket operation makes logic a little complex. As current PR change does: 1)acquire bucket lock 2)release bucket lock 3)acquire volume lock update volume usedBytes usage 4)release volume lock 5)acquire bucket lock again (to finish remaining operation) 6)release bucket lock Can we just make the method OMKeyRequest#getVolumeInfo be thread safe to return a copied object, this should be okay for current issue? This will make the logic more simplified. Like: {code:java} public static synchronized OmVolumeArgs getVolumeInfo(OMMetadataManager omMetadataManager, String volume) { return omMetadataManager.getVolumeTable().getCacheValue( new CacheKey<>(omMetadataManager.getVolumeKey(volume))) .getCacheValue().copyObject(); } {code} > Fix issue with quota update > --- > > Key: HDDS-4308 > URL: https://issues.apache.org/jira/browse/HDDS-4308 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Bharat Viswanadham >Assignee: mingchao zhao >Priority: Blocker > Labels: pull-request-available > > Currently volumeArgs using getCacheValue and put the same object in > doubleBuffer, this might cause issue. > Let's take the below scenario: > InitialVolumeArgs quotaBytes -> 1 > 1. T1 -> Update VolumeArgs, and subtracting 1000 and put this updated > volumeArgs to DoubleBuffer. > 2. T2-> Update VolumeArgs, and subtracting 2000 and has not still updated to > double buffer. > *Now at the end of flushing these transactions, our DB should have 7000 as > bytes used.* > Now T1 is picked by double Buffer and when it commits, and as it uses cached > Object put into doubleBuffer, it flushes to DB with the updated value from > T2(As it is a cache object) and update DB with bytesUsed as 7000. > And now OM has restarted, and only DB has transactions till T1. (We get this > info from TransactionInfo > Table(https://issues.apache.org/jira/browse/HDDS-3685) > Now T2 is again replayed, as it is not committed to DB, now DB will be again > subtracted with 2000, and now DB will have 5000. > But after T2, the value should be 7000, so we have DB in an incorrect state. > Issue here: > 1. As we use a cached object and put the same cached object into double > buffer this can cause this kind of issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-4280) Document notable configurations for Recon
[ https://issues.apache.org/jira/browse/HDDS-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-4280: Fix Version/s: 1.1.0 > Document notable configurations for Recon > -- > > Key: HDDS-4280 > URL: https://issues.apache.org/jira/browse/HDDS-4280 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Recon >Affects Versions: 1.0.0 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > Fix For: 1.1.0 > > > In [Reon doc > link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is > no helpful description about how to quickly setup the Recon server. As Recon > is one major feature in Ozone 1.0 version, we need to completed this document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-4285) Read is slow due to the frequent usage of UGI.getCurrentUserCall()
[ https://issues.apache.org/jira/browse/HDDS-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203321#comment-17203321 ] Yiqun Lin edited comment on HDDS-4285 at 9/28/20, 3:58 PM: --- Looking into this, I am thinking of two approaches for this: 1. Initialize UGI instance in ChunkInputStream (or other invoke places), then set UGI in XceiverClientSpi, extract UGI and get token string in ContainerProtocolCalls method. 2. Make UGI as a thread local field in ContainerProtocolCalls, and then reset ContainerProtocolCalls#UGI in ChunkInputStream or other places. #1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in other places. was (Author: linyiqun): Looking into this, I am thinking of two approaches for this: 1. Initialize UGI instance in ChunkInputStream (or other invoke places), then set UGI in XceiverClientSpi, extract UGI and get token string in ContainerProtocolCalls method. 2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI in ChunkInputStream or other similar places. #1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in other places. > Read is slow due to the frequent usage of UGI.getCurrentUserCall() > -- > > Key: HDDS-4285 > URL: https://issues.apache.org/jira/browse/HDDS-4285 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: image-2020-09-28-16-19-17-581.png, > profile-20200928-161631-180518.svg > > > Ozone read operation turned out to be slow mainly because we do a new > UGI.getCurrentUser for block token for each of the calls. > We need to cache the block token / UGI.getCurrentUserCall() to make it faster. > !image-2020-09-28-16-19-17-581.png! > To reproduce: > Checkout: https://github.com/elek/hadoop-ozone/tree/mocked-read > {code} > cd hadoop-ozone/client > export > MAVEN_OPTS=-agentpath:/home/elek/prog/async-profiler/build/libasyncProfiler.so=start,file=/tmp/profile-%t-%p.svg > mvn compile exec:java > -Dexec.mainClass=org.apache.hadoop.ozone.client.io.TestKeyOutputStreamUnit > -Dexec.classpathScope=test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-4285) Read is slow due to the frequent usage of UGI.getCurrentUserCall()
[ https://issues.apache.org/jira/browse/HDDS-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203321#comment-17203321 ] Yiqun Lin edited comment on HDDS-4285 at 9/28/20, 3:56 PM: --- Looking into this, I am thinking of two approaches for this: 1. Initialize UGI instance in ChunkInputStream (or other invoke places), then set UGI in XceiverClientSpi, extract UGI and get token string in ContainerProtocolCalls method. 2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI in ChunkInputStream or other similar places. #1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in other places. was (Author: linyiqun): Looking into this, I am thinking of two approaches for this: 1. Get UGI instance in ChunkInputStream (or other invoke places), then set UGI in XceiverClientSpi, extract UGI and get token string in ContainerProtocolCalls method. 2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI in ChunkInputStream or other similar places. #1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in other places. > Read is slow due to the frequent usage of UGI.getCurrentUserCall() > -- > > Key: HDDS-4285 > URL: https://issues.apache.org/jira/browse/HDDS-4285 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: image-2020-09-28-16-19-17-581.png, > profile-20200928-161631-180518.svg > > > Ozone read operation turned out to be slow mainly because we do a new > UGI.getCurrentUser for block token for each of the calls. > We need to cache the block token / UGI.getCurrentUserCall() to make it faster. > !image-2020-09-28-16-19-17-581.png! > To reproduce: > Checkout: https://github.com/elek/hadoop-ozone/tree/mocked-read > {code} > cd hadoop-ozone/client > export > MAVEN_OPTS=-agentpath:/home/elek/prog/async-profiler/build/libasyncProfiler.so=start,file=/tmp/profile-%t-%p.svg > mvn compile exec:java > -Dexec.mainClass=org.apache.hadoop.ozone.client.io.TestKeyOutputStreamUnit > -Dexec.classpathScope=test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-4285) Read is slow due to the frequent usage of UGI.getCurrentUserCall()
[ https://issues.apache.org/jira/browse/HDDS-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203321#comment-17203321 ] Yiqun Lin commented on HDDS-4285: - Looking into this, I am thinking of two approaches for this: 1. Get UGI instance in ChunkInputStream (or other invoke places), then set UGI in XceiverClientSpi, extract UGI and get token string in ContainerProtocolCalls method. 2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI in ChunkInputStream or other similar places. #1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in other places. > Read is slow due to the frequent usage of UGI.getCurrentUserCall() > -- > > Key: HDDS-4285 > URL: https://issues.apache.org/jira/browse/HDDS-4285 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: image-2020-09-28-16-19-17-581.png, > profile-20200928-161631-180518.svg > > > Ozone read operation turned out to be slow mainly because we do a new > UGI.getCurrentUser for block token for each of the calls. > We need to cache the block token / UGI.getCurrentUserCall() to make it faster. > !image-2020-09-28-16-19-17-581.png! > To reproduce: > Checkout: https://github.com/elek/hadoop-ozone/tree/mocked-read > {code} > cd hadoop-ozone/client > export > MAVEN_OPTS=-agentpath:/home/elek/prog/async-profiler/build/libasyncProfiler.so=start,file=/tmp/profile-%t-%p.svg > mvn compile exec:java > -Dexec.mainClass=org.apache.hadoop.ozone.client.io.TestKeyOutputStreamUnit > -Dexec.classpathScope=test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-4283) Remove unsupported upgrade command in ozone cli
[ https://issues.apache.org/jira/browse/HDDS-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203145#comment-17203145 ] Yiqun Lin commented on HDDS-4283: - Thanks [~adoroszlai] for the reference, closed this JIRA. > Remove unsupported upgrade command in ozone cli > --- > > Key: HDDS-4283 > URL: https://issues.apache.org/jira/browse/HDDS-4283 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > > In HDDS-1383, we introduce a new upgrade command for supporting to in-place > upgrade from HDFS to Ozone. > {noformat} > upgrade HDFS to Ozone in-place upgrade tool > > Usage: ozone upgrade [-hV] [--verbose] [-conf=] > [-D=]... [COMMAND] > Convert raw HDFS data to Ozone data without data movement. > --verbose More verbose output. Show the stack trace of the errors. > -conf= > -D, --set= > -h, --help Show this help message and exit. > -V, --version Print version information and exit. > Commands: > plan Plan existing HDFS block distribution and give.estimation. > balance Move the HDFS blocks for a better distribution usage. > execute Start/restart upgrade from HDFS to Ozone cluster. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan > [In-Place upgrade : plan] is not yet supported. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance > [In-Place upgrade : balance] is not yet supported. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute > In-Place upgrade : execute] is not yet supported > {noformat} > But this feature has not been implemented yet and is a very big feature. > I don't think it's good to expose a cli command that is not supported and > meanwhile that cannot be quickly implemented in the short term. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-4283) Remove unsupported upgrade command in ozone cli
[ https://issues.apache.org/jira/browse/HDDS-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin resolved HDDS-4283. - Resolution: Duplicate > Remove unsupported upgrade command in ozone cli > --- > > Key: HDDS-4283 > URL: https://issues.apache.org/jira/browse/HDDS-4283 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > > In HDDS-1383, we introduce a new upgrade command for supporting to in-place > upgrade from HDFS to Ozone. > {noformat} > upgrade HDFS to Ozone in-place upgrade tool > > Usage: ozone upgrade [-hV] [--verbose] [-conf=] > [-D=]... [COMMAND] > Convert raw HDFS data to Ozone data without data movement. > --verbose More verbose output. Show the stack trace of the errors. > -conf= > -D, --set= > -h, --help Show this help message and exit. > -V, --version Print version information and exit. > Commands: > plan Plan existing HDFS block distribution and give.estimation. > balance Move the HDFS blocks for a better distribution usage. > execute Start/restart upgrade from HDFS to Ozone cluster. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan > [In-Place upgrade : plan] is not yet supported. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance > [In-Place upgrade : balance] is not yet supported. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute > In-Place upgrade : execute] is not yet supported > {noformat} > But this feature has not been implemented yet and is a very big feature. > I don't think it's good to expose a cli command that is not supported and > meanwhile that cannot be quickly implemented in the short term. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-4283) Remove unsupported upgrade command in ozone cli
[ https://issues.apache.org/jira/browse/HDDS-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-4283: Description: In HDDS-1383, we introduce a new upgrade command for supporting to in-place upgrade from HDFS to Ozone. {noformat} upgrade HDFS to Ozone in-place upgrade tool Usage: ozone upgrade [-hV] [--verbose] [-conf=] [-D=]... [COMMAND] Convert raw HDFS data to Ozone data without data movement. --verbose More verbose output. Show the stack trace of the errors. -conf= -D, --set= -h, --help Show this help message and exit. -V, --version Print version information and exit. Commands: plan Plan existing HDFS block distribution and give.estimation. balance Move the HDFS blocks for a better distribution usage. execute Start/restart upgrade from HDFS to Ozone cluster. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan [In-Place upgrade : plan] is not yet supported. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance [In-Place upgrade : balance] is not yet supported. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute In-Place upgrade : execute] is not yet supported {noformat} But this feature has not been implemented yet and is a very big feature. I don't think it's good to expose a cli command that is not supported and meanwhile that cannot be quickly implemented in the short term. was: In HDDS-1383, we introduce a new upgrade command for supporting to in-place upgrade from HDFS to Ozone. {noformat} upgrade HDFS to Ozone in-place upgrade tool Usage: ozone upgrade [-hV] [--verbose] [-conf=] [-D=]... [COMMAND] Convert raw HDFS data to Ozone data without data movement. --verbose More verbose output. Show the stack trace of the errors. -conf= -D, --set= -h, --help Show this help message and exit. -V, --version Print version information and exit. Commands: plan Plan existing HDFS block distribution and give.estimation. balance Move the HDFS blocks for a better distribution usage. execute Start/restart upgrade from HDFS to Ozone cluster. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan [In-Place upgrade : plan] is not yet supported. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance [In-Place upgrade : balance] is not yet supported. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute In-Place upgrade : execute] is not yet supported {noformat} But this feature has not been implemented yet and is a very big feature. I don't think it's good to expose a cli command that is not supported and meanwhile cannot be quickly implemented in the short term. > Remove unsupported upgrade command in ozone cli > --- > > Key: HDDS-4283 > URL: https://issues.apache.org/jira/browse/HDDS-4283 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > > In HDDS-1383, we introduce a new upgrade command for supporting to in-place > upgrade from HDFS to Ozone. > {noformat} > upgrade HDFS to Ozone in-place upgrade tool > > Usage: ozone upgrade [-hV] [--verbose] [-conf=] > [-D=]... [COMMAND] > Convert raw HDFS data to Ozone data without data movement. > --verbose More verbose output. Show the stack trace of the errors. > -conf= > -D, --set= > -h, --help Show this help message and exit. > -V, --version Print version information and exit. > Commands: > plan Plan existing HDFS block distribution and give.estimation. > balance Move the HDFS blocks for a better distribution usage. > execute Start/restart upgrade from HDFS to Ozone cluster. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan > [In-Place upgrade : plan] is not yet supported. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance > [In-Place upgrade : balance] is not yet supported. > [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute > In-Place upgrade : execute] is not yet supported > {noformat} > But this feature has not been implemented yet and is a very big feature. > I don't think it's good to expose a cli command that is not supported and > meanwhile that cannot be quickly implemented in the short term. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-4283) Remove unsupported upgrade command in ozone cli
Yiqun Lin created HDDS-4283: --- Summary: Remove unsupported upgrade command in ozone cli Key: HDDS-4283 URL: https://issues.apache.org/jira/browse/HDDS-4283 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Yiqun Lin Assignee: Yiqun Lin In HDDS-1383, we introduce a new upgrade command for supporting to in-place upgrade from HDFS to Ozone. {noformat} upgrade HDFS to Ozone in-place upgrade tool Usage: ozone upgrade [-hV] [--verbose] [-conf=] [-D=]... [COMMAND] Convert raw HDFS data to Ozone data without data movement. --verbose More verbose output. Show the stack trace of the errors. -conf= -D, --set= -h, --help Show this help message and exit. -V, --version Print version information and exit. Commands: plan Plan existing HDFS block distribution and give.estimation. balance Move the HDFS blocks for a better distribution usage. execute Start/restart upgrade from HDFS to Ozone cluster. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan [In-Place upgrade : plan] is not yet supported. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance [In-Place upgrade : balance] is not yet supported. [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute In-Place upgrade : execute] is not yet supported {noformat} But this feature has not been implemented yet and is a very big feature. I don't think it's good to expose a cli command that is not supported and meanwhile cannot be quickly implemented in the short term. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-4280) Document notable configurations for Recon
[ https://issues.apache.org/jira/browse/HDDS-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-4280: Status: Patch Available (was: Open) > Document notable configurations for Recon > -- > > Key: HDDS-4280 > URL: https://issues.apache.org/jira/browse/HDDS-4280 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Recon >Affects Versions: 1.0.0 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > > In [Reon doc > link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is > no helpful description about how to quickly setup the Recon server. As Recon > is one major feature in Ozone 1.0 version, we need to completed this document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-4280) Document notable configurations for Recon
[ https://issues.apache.org/jira/browse/HDDS-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-4280: Summary: Document notable configurations for Recon (was: Document notable configuration for Recon ) > Document notable configurations for Recon > -- > > Key: HDDS-4280 > URL: https://issues.apache.org/jira/browse/HDDS-4280 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Recon >Affects Versions: 1.0.0 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > > In [Reon doc > link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is > no helpful description about how to quickly setup the Recon server. As Recon > is one major feature in Ozone 1.0 version, we need to completed this document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-4280) Document notable configuration for Recon
Yiqun Lin created HDDS-4280: --- Summary: Document notable configuration for Recon Key: HDDS-4280 URL: https://issues.apache.org/jira/browse/HDDS-4280 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Recon Affects Versions: 1.0.0 Reporter: Yiqun Lin Assignee: Yiqun Lin In [Reon doc link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is no helpful description about how to quickly setup the Recon server. As Recon is one major feature in Ozone 1.0 version, we need to completed this document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-4267) Ozone command always print warn message before execution
Yiqun Lin created HDDS-4267: --- Summary: Ozone command always print warn message before execution Key: HDDS-4267 URL: https://issues.apache.org/jira/browse/HDDS-4267 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone CLI Reporter: Yiqun Lin Ozone command always print warn message before execution: {noformat} [hdfs@lyq yiqlin]$ ~/ozone/bin/ozone version /home/hdfs/releases/ozone-1.0.0/etc/hadoop/hadoop-env.sh: line 34: ulimit: core file size: cannot modify limit: Operation not permitted {noformat} {noformat} [hdfs@ yiqlin]$ ~/ozone/bin/ozone sh volume list /home/hdfs/releases/ozone-1.0.0/etc/hadoop/hadoop-env.sh: line 34: ulimit: core file size: cannot modify limit: Operation not permitted {noformat} This is because that hdfs in my cluster cannot execute below command in hadoop-en.sh {noformat} # # Enable core dump when crash in C++ ulimit -c unlimited {noformat} ulimit -c was introduced in JIRA HDDS-3941. The root cause seems that ulimit -c requires root user to execute but hdfs user in my local is a non-root user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-4222) [OzoneFS optimization] Provide a mechanism for efficient path lookup
[ https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194239#comment-17194239 ] Yiqun Lin commented on HDDS-4222: - [~rakeshr] , thanks for the explanation, it's very clear explained for me. I'm also +1 to let KeyDeletingService to help remove deleted cache entries here. > [OzoneFS optimization] Provide a mechanism for efficient path lookup > > > Key: HDDS-4222 > URL: https://issues.apache.org/jira/browse/HDDS-4222 > Project: Hadoop Distributed Data Store > Issue Type: New Feature >Reporter: Rakesh Radhakrishnan >Assignee: Rakesh Radhakrishnan >Priority: Major > Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf > > > With the new file system HDDS-2939 like semantics design it requires multiple > DB lookups to traverse the path component in top-down fashion. This task to > discuss use cases and proposals to reduce the performance penalties during > path lookups. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-4222) [OzoneFS optimization] Provide a mechanism for efficient path lookup
[ https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192947#comment-17192947 ] Yiqun Lin edited comment on HDDS-4222 at 9/9/20, 3:29 PM: -- Thanks for attaching the dir cache design, [~rakeshr]! I agree most of the current design details. >For the consistency part, this is a very good point and will take care during >the implementation phase. I was thinking to update the cache during write and >read paths to avoid additional cache refresh cycle. I'm +1 for this way as current initial implementation. >Rename and Delete ops will require only one entry update as it maintains >similar structure in the DB Directory Table. Delete ops is also not friendly for the Approach-3. Example: *DirTable:* |CacheKey(PathElement)|{color:#ff8b00}*ObjectID*{color}| |512/a|1025| |1025/b|1026| |1026/c|1027| |1027/d|1028| |1025/e|1029| If we delete dir 512/a, it should lookup the whole dir cache and find the key which parent objectID is 1025 and then be deleted. So delete ops here seems still the very expensive ops. >Delete ops will require only one entry update If we use the sync delete way, it will not update for only one entry (as explained in above example). So is this mean the async delete way here, like bucket key deletion mechanism? 1) Mark the delete key and let it not be accessed(e.g. add prefix in key) 2) Async to remove these keys that needed to be deleted. was (Author: linyiqun): Thanks for attaching the dir cache design, [~rakeshr]! I agree most of the current design details. >For the consistency part, this is a very good point and will take care during >the implementation phase. I was thinking to update the cache during write and >read paths to avoid additional cache refresh cycle. I'm +1 for this way as current initial implementation. >Rename and Delete ops will require only one entry update as it maintains >similar structure in the DB Directory Table. Delete ops is also not friendly for the Approach-3. Example: *DirTable:* |CacheKey(PathElement)|{color:#ff8b00}*ObjectID*{color}| |512/a|1025| |1025/b|1026| |1026/c|1027| |1027/d|1028| |1025/e|1029| If we delete dir 512/a, it should lookup the whole dir cache and find the key which parent objectID is 1025 and then be deleted. So delete ops here seems still the very expensive ops. >Delete ops will require only one entry update If we use the sync delete way, it will not update for only one entry (as explained in above example). So is this mean the async delete way here, like bucket key deletion mechanism? 1) Mark the delete key,(add prefix in key to let it not be accessed) 2) Async to remove these keys that needed to be deleted. > [OzoneFS optimization] Provide a mechanism for efficient path lookup > > > Key: HDDS-4222 > URL: https://issues.apache.org/jira/browse/HDDS-4222 > Project: Hadoop Distributed Data Store > Issue Type: New Feature >Reporter: Rakesh Radhakrishnan >Assignee: Rakesh Radhakrishnan >Priority: Major > Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf > > > With the new file system HDDS-2939 like semantics design it requires multiple > DB lookups to traverse the path component in top-down fashion. This task to > discuss use cases and proposals to reduce the performance penalties during > path lookups. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-4222) [OzoneFS optimization] Provide a mechanism for efficient path lookup
[ https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192947#comment-17192947 ] Yiqun Lin commented on HDDS-4222: - Thanks for attaching the dir cache design, [~rakeshr]! I agree most of the current design details. >For the consistency part, this is a very good point and will take care during >the implementation phase. I was thinking to update the cache during write and >read paths to avoid additional cache refresh cycle. I'm +1 for this way as current initial implementation. >Rename and Delete ops will require only one entry update as it maintains >similar structure in the DB Directory Table. Delete ops is also not friendly for the Approach-3. Example: *DirTable:* |CacheKey(PathElement)|{color:#ff8b00}*ObjectID*{color}| |512/a|1025| |1025/b|1026| |1026/c|1027| |1027/d|1028| |1025/e|1029| If we delete dir 512/a, it should lookup the whole dir cache and find the key which parent objectID is 1025 and then be deleted. So delete ops here seems still the very expensive ops. >Delete ops will require only one entry update If we use the sync delete way, it will not update for only one entry (as explained in above example). So is this mean the async delete way here, like bucket key deletion mechanism? 1) Mark the delete key,(add prefix in key to let it not be accessed) 2) Async to remove these keys that needed to be deleted. > [OzoneFS optimization] Provide a mechanism for efficient path lookup > > > Key: HDDS-4222 > URL: https://issues.apache.org/jira/browse/HDDS-4222 > Project: Hadoop Distributed Data Store > Issue Type: New Feature >Reporter: Rakesh Radhakrishnan >Assignee: Rakesh Radhakrishnan >Priority: Major > Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf > > > With the new file system HDDS-2939 like semantics design it requires multiple > DB lookups to traverse the path component in top-down fashion. This task to > discuss use cases and proposals to reduce the performance penalties during > path lookups. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190253#comment-17190253 ] Yiqun Lin edited comment on HDDS-2939 at 9/3/20, 4:19 PM: -- Some discussion about latest status of HDDS-2939 that I asked in mailing list. From [~rakeshr]: {quote}Presently, I am working on the directory cache design and upgrade design. These two tasks are very important as the first one would help to *reduce the performance penalties on the path traversal*. Later one is to provide an efficient way to make a smooth upgrade experience to the users. {quote} Here the directory cache is used for avoid the additional look up overheads. Latest design of directory cache hasn't been attached but just some thoughts from me: Two type mapping cache will be useful I think: * , like , so that we can skip the traverse search from dir table to key table. * >, this is used for the listStatus scenario, list files call can be a very expensive call under Ozone fs namespace. Cache introduced here can speed up the metadata access but also there are two aspects we need to consider. * Cache entry eviction policy for this, we cannot cache all the dir/file entries. * Consistency between dir cache and underlying store. Cache entry will become stale when db store updated but not synced in corresponding cache entry. The cache refresh interval time can be introduced here. Only when the cache entry not updated more than given refresh interval, then we trigger update cache entry from querying the db table. Users can set different refresh interval time to ensure the cache freshness based on their scenarios. Also they can disable this cache by set interval to 0 that means each query will directly access to db. Current OM table cache seems not very helpful for dir cache so I came up with above thoughts. was (Author: linyiqun): Some discussion about latest status of HDDS-2939 that I asked in mailing list. From [~rakeshr]: {quote}Presently, I am working on the directory cache design and upgrade design. These two tasks are very important as the first one would help to *reduce the performance penalties on the path traversal*. Later one is to provide an efficient way to make a smooth upgrade experience to the users. {quote} Here the directory cache is used for avoid the additional look up overheads. Latest design of directory cache hasn't been attached but just some thoughts from me: Two type mapping cache will be useful I think: * , like , so that we can skip the traverse search from dir table to key table. * >, this is used for the listStatus scenario, list files call can be a very expensive call under Ozone fs namespace. Cache introduced here can speed up the metadata access but also there are two aspects we need to consider. * Cache entry eviction policy for this, we cannot cache all the dir/file entries. * Consistency between dir cache and underlying store. Cache entry will become stale when db store updated but not synced in corresponding cache entry. The cache refresh interval time can be introduced here. Only when the cache entry not updated more than given refresh interval, then we trigger update cache entry from querying the db table. Users can set different refresh interval time to ensure the cache freshness based on their scenarios. Also they can disable this cache by set interval to 0 that means each query will directly access to db. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190253#comment-17190253 ] Yiqun Lin edited comment on HDDS-2939 at 9/3/20, 4:04 PM: -- Some discussion about latest status of HDDS-2939 that I asked in mailing list. From [~rakeshr]: {quote}Presently, I am working on the directory cache design and upgrade design. These two tasks are very important as the first one would help to *reduce the performance penalties on the path traversal*. Later one is to provide an efficient way to make a smooth upgrade experience to the users. {quote} Here the directory cache is used for avoid the additional look up overheads. Latest design of directory cache hasn't been attached but just some thoughts from me: Two type mapping cache will be useful I think: * , like , so that we can skip the traverse search from dir table to key table. * >, this is used for the listStatus scenario, list files call can be a very expensive call under Ozone fs namespace. Cache introduced here can speed up the metadata access but also there are two aspects we need to consider. * Cache entry eviction policy for this, we cannot cache all the dir/file entries. * Consistency between dir cache and underlying store. Cache entry will become stale when db store updated but not synced in corresponding cache entry. The cache refresh interval time can be introduced here. Only when the cache entry not updated more than given refresh interval, then we trigger update cache entry from querying the db table. Users can set different refresh interval time to ensure the cache freshness based on their scenarios. Also they can disable this cache by set interval to 0 that means each query will directly access to db. was (Author: linyiqun): Some discussion about latest status of HDDS-2939 that I asked in mailing list. From [~rakeshr]: {quote}Presently, I am working on the directory cache design and upgrade design. These two tasks are very important as the first one would help to *reduce the performance penalties on the path traversal*. Later one is to provide an efficient way to make a smooth upgrade experience to the users. {quote} Here the directory cache is used for avoid the additional look up overheads. Latest design of directory cache hasn't been attached but just some thoughts from me: Two type mapping cache will be useful I think: * , like , so that we can skip the traverse search from dir table to key table. * >, this is used for the listStatus scenario, list files call can be a very expensive call under Ozone fs namespace. Cache introduced here can speed up the metadata access but also there are two aspects we need to consider. * Cache entry eviction policy for this, we cannot cache all the dir/file entries. * Consistency between dir cache and underlying store. The cache refresh interval time can be introduced here. Only when the cache entry not updated more than given refresh interval, then we trigger update cache entry from querying the db table. Users can set different refresh interval time to ensure the cache freshness based on their scenarios. Also they can disable this cache by set interval to 0 that means each query will directly access to db. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190253#comment-17190253 ] Yiqun Lin commented on HDDS-2939: - Some discussion about latest status of HDDS-2939 that I asked in mailing list. From [~rakeshr]: {quote}Presently, I am working on the directory cache design and upgrade design. These two tasks are very important as the first one would help to *reduce the performance penalties on the path traversal*. Later one is to provide an efficient way to make a smooth upgrade experience to the users. {quote} Here the directory cache is used for avoid the additional look up overheads. Latest design of directory cache hasn't been attached but just some thoughts from me: Two type mapping cache will be useful I think: * , like , so that we can skip the traverse search from dir table to key table. * >, this is used for the listStatus scenario, list files call can be a very expensive call under Ozone fs namespace. Cache introduced here can speed up the metadata access but also there are two aspects we need to consider. * Cache entry eviction policy for this, we cannot cache all the dir/file entries. * Consistency between dir cache and underlying store. The cache refresh interval time can be introduced here. Only when the cache entry not updated more than given refresh interval, then we trigger update cache entry from querying the db table. Users can set different refresh interval time to ensure the cache freshness based on their scenarios. Also they can disable this cache by set interval to 0 that means each query will directly access to db. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-4166) Documentation index page redirects to the wrong address
Yiqun Lin created HDDS-4166: --- Summary: Documentation index page redirects to the wrong address Key: HDDS-4166 URL: https://issues.apache.org/jira/browse/HDDS-4166 Project: Hadoop Distributed Data Store Issue Type: Bug Components: documentation Reporter: Yiqun Lin Attachments: image-2020-08-29-10-35-34-633.png Reading Chinese doc of Ozone that introduced in HDDS-2708. I find a page error that index page redirects a wrong page. The steps: 1. Jump into [https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/index.html] 2. Click the page switch button. 3. The wrong page we will jumped into: [https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/zh/] !image-2020-08-29-10-35-34-633.png! It missed the 'index.html' at the end of the address, actually this address is expected to [https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/zh/index.html] The same error happened when I switch the index page from Chinese doc to English doc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167289#comment-17167289 ] Yiqun Lin commented on HDDS-2939: - {quote} Along with this feature, one task is planned to provide a migration/conversion tool to migrate existing 'KeyTable' data content into the new data format, ofcourse we need to add a key generation logic here. {quote} yes, makes sense to me. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165155#comment-17165155 ] Yiqun Lin edited comment on HDDS-2939 at 7/26/20, 4:06 AM: --- Hi [~rakeshr], I take the quick look for the first PR PR-1230. It reuses the PrefixTable to store like directory info I have a question: this will not break original key lookup behavior? We don't have the object id assigned for keys in prefix table, but current logic assumes that each entry key has its own objectId in table. Or maybe we will have a step to generate this object id firstly when users start to enable this feature from their existed Ozone system. was (Author: linyiqun): Hi [~rakeshr], I take the quick look for the first PR PR-1230. It reuses the PrefixTable to store like directory info I have a question: this will not break original key lookup behavior? We don't have the object id assigned for keys in prefix table, but current logic assumes that each entry key has its own objectId in table. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165155#comment-17165155 ] Yiqun Lin commented on HDDS-2939: - Hi [~rakeshr], I take the quick look for the first PR PR-1230. It reuses the PrefixTable to store like directory info I have a question: this will not break original key lookup behavior? We don't have the object id assigned for keys in prefix table, but current logic assumes that each entry key has its own objectId in table. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3816) Erasure Coding in Apache Hadoop Ozone
[ https://issues.apache.org/jira/browse/HDDS-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156177#comment-17156177 ] Yiqun Lin commented on HDDS-3816: - Sorry for the delayed response. Thanks [~umamaheswararao] for the detailed explanation, :). > Erasure Coding in Apache Hadoop Ozone > - > > Key: HDDS-3816 > URL: https://issues.apache.org/jira/browse/HDDS-3816 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: SCM >Reporter: Uma Maheswara Rao G >Priority: Major > Attachments: Erasure Coding in Apache Hadoop Ozone.pdf > > > We propose to implement Erasure Coding in Apache Hadoop Ozone to provide > efficient storage. With EC in place, Ozone can provide same or better > tolerance by giving 50% or more storage space savings. > In HDFS project, we already have native codecs(ISAL) and Java codecs > implemented, we can leverage the same or similar codec design. > However, the critical part of EC data layout design is in-progress, we will > post the design doc soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149781#comment-17149781 ] Yiqun Lin commented on HDDS-2939: - Thanks for the detailed explanation, [~maobaolong]. I also agree that caching tier can speed up metadata access operation in Ozone. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147817#comment-17147817 ] Yiqun Lin edited comment on HDDS-2939 at 6/29/20, 2:43 PM: --- Hi [~maobaolong], looked into the approach that used in Alluxio, I am curious one thing: In Alluxio 2.0, it still keeps all metadata in its master service? I see this in article [https://dzone.com/articles/store-1-billion-files-in-alluxio-20] that describes how to store 1 billion files metadata in Alluxio. {quote}The metadata service in Alluxio 2.0 is designed to support at least 1 billion files with a significantly reduced memory requirement. To achieve this, we added support for storing part of the namespace off-heap by RocksDB on disk. Recently-accessed file system metadata is stored in memory,... {quote} >From my understanding of this, in disk only part of metadata stored and >meanwhile in memory it caches recent-accessed data, does it really store all >metadata in its master service? does it match the case of Ozone Fs? In Ozone >Fs, we will store all metadata. Or can I understand that Alluxio 2.0 >maintains only active metadata instead of the whole metadata, this active >metadata can be updated(activated/deactivated) by the user access file >behaviors, so it can support billion level metadata. BTW, memory cached only for hot metadata is a good point that Ozone Fs can also benefit from this. was (Author: linyiqun): Hi [~maobaolong], looked into the approach that used in Alluxio, I am curious one thing: In Alluxio 2.0, it still keeps all metadata in its master service? I see this in article [https://dzone.com/articles/store-1-billion-files-in-alluxio-20] that describes how to store 1 billion files metadata in Alluxio. {quote}The metadata service in Alluxio 2.0 is designed to support at least 1 billion files with a significantly reduced memory requirement. To achieve this, we added support for storing part of the namespace off-heap by RocksDB on disk. Recently-accessed file system metadata is stored in memory,... {quote} >From my understanding of this, in disk only part of metadata stored and >meanwhile in memory it caches recent-accessed data, does it really store all >metadata in its master service? does it match the case of Ozone Fs? In Ozone >Fs, we will store all metadata. BTW, memory cached only for hot metadata is a good point that Ozone Fs can also benefit from this. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147817#comment-17147817 ] Yiqun Lin commented on HDDS-2939: - Hi [~maobaolong], looked into the approach that used in Alluxio, I am curious one thing: In Alluxio 2.0, it still keeps all metadata in its master service? I see this in article [https://dzone.com/articles/store-1-billion-files-in-alluxio-20] that describes how to store 1 billion files metadata in Alluxio. {quote}The metadata service in Alluxio 2.0 is designed to support at least 1 billion files with a significantly reduced memory requirement. To achieve this, we added support for storing part of the namespace off-heap by RocksDB on disk. Recently-accessed file system metadata is stored in memory,... {quote} >From my understanding of this, in disk only part of metadata stored and >meanwhile in memory it caches recent-accessed data, does it really store all >metadata in its master service? does it match the case of Ozone Fs? In Ozone >Fs, we will store all metadata. BTW, memory cached only for hot metadata is a good point that Ozone Fs can also benefit from this. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Rakesh Radhakrishnan >Priority: Major > Labels: Triaged > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3816) Erasure Coding in Apache Hadoop Ozone
[ https://issues.apache.org/jira/browse/HDDS-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143863#comment-17143863 ] Yiqun Lin commented on HDDS-3816: - Hi [~umamaheswararao], the design doc looks great. I go through all the design today and some comments from me. The design doc introduces Container level, Block Level EC implementation and corresponding advantages/disadvantages. But it doesn't mentioned which is the final choice? Or that means we want to implement both of them and user can chose which way they prefer? For the Container level option, it will be more easier to implement than Block Level option. But as design doc also mentioned that, it has more impact of this option, for example, delete operation impact(additionally need to implement small container merge), data recovery cost and high risk of data loss when some node crashed. From my personal opinion, Block level option is a more complete and robust implementation. How do we think of this? For the read/write performance comparison, the Block level EC will have a better performance. The block is split into multiple nodes as a striped storage. We can parallel read/write the data based on this change. In Container level, the block data structure in one Container actually unchanged, it still keeps continuous way but just has a striped form in Container level. So the read/write rate is exactly not changed under Container level EC. We still need to find one specific Container node to read/write for specific block data. What's the implementation complexity of this two options way? Like can we perfectly integrated current HDFS EC algorithm implementation into Ozone? In order to support EC, if there will be a large code refactor in current read/write implementation? I see current EC design depends on the abstraction of storage-class implementation. I'm not sure if this is an easy thing to do at the beginning of Ozone EC implementation. Storage-class implementation is also a large feature I think, we define data storage type, policy and multiple rules to let system do the data transform automatically and transparently. This is similar to HDFS SSM(smart storage management) feature design in HDFS-7343. I'm not means to disagree of storage-class, but have a concern if we let this as one thing we must to implement first. Please correct me if I am wrong, thanks. > Erasure Coding in Apache Hadoop Ozone > - > > Key: HDDS-3816 > URL: https://issues.apache.org/jira/browse/HDDS-3816 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: SCM >Reporter: Uma Maheswara Rao G >Priority: Major > Attachments: Erasure Coding in Apache Hadoop Ozone.pdf > > > We propose to implement Erasure Coding in Apache Hadoop Ozone to provide > efficient storage. With EC in place, Ozone can provide same or better > tolerance by giving 50% or more storage space savings. > In HDFS project, we already have native codecs(ISAL) and Java codecs > implemented, we can leverage the same or similar codec design. > However, the critical part of EC data layout design is in-progress, we will > post the design doc soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3755) Storage-class support for Ozone
[ https://issues.apache.org/jira/browse/HDDS-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135179#comment-17135179 ] Yiqun Lin commented on HDDS-3755: - A very interesting design! Storage-class abstract makes Ozone data storage more smart. One comment for this {quote}Transfer rule: We can define some rule to describe when(condition or timer) invoke a convert action. For example, convert files from a storage-class to another storage-class when files lives 7 days. {quote} For the detailed of transfer rule between different storage-class, I think we can have a rule store and an independent service, called smart storage manager. This service can manage storage policy rules that passed by admin users. And this smart storage service will read given transfer rule and send SCM request o do corresponding actions. > Storage-class support for Ozone > --- > > Key: HDDS-3755 > URL: https://issues.apache.org/jira/browse/HDDS-3755 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > > Use a storage-class as an abstraction which combines replication > configuration, container states and transitions. > See this thread for the detailed design doc: > > [https://lists.apache.org/thread.html/r1e2a5d5581abe9dd09834305ca65a6807f37bd229a07b8b31bda32ad%40%3Cozone-dev.hadoop.apache.org%3E] > which is also uploaded to here: > https://hackmd.io/4kxufJBOQNaKn7PKFK_6OQ?edit -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3698) Ozone Non-Rolling upgrades.
[ https://issues.apache.org/jira/browse/HDDS-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129495#comment-17129495 ] Yiqun Lin commented on HDDS-3698: - Hi [~avijayan], a very good design for this. In this design, we talk about many upgrade challenges things in different components. But I am interested in detailed about what will happened when we upgrade failed and have to do downgrade behavior. Will we ensure the 100% consistency compared with trigger downgraded behavior before? I see we introduce the new table TransactionInfoTable and let new transactions write into this table. So from my understanding for this, this is a temporary table to store the transaction during upgrade procedure (not finalized). Since new feature operation is now allowed before finalized, so this table won't store incompatible type transaction. Then in downgrading, OM won't throw error when replying transaction table, right? To protect our data before cluster be finalized, another thing I am thinking that we would be better not directly delete file data. We can move deleted container into a trash directory in Datanode. if downgraded happened, we will restore trash data back, if finalized, then delete them. This is the way that HDFS currently used in its upgrade, this also makes sense to Ozone upgrade. > Ozone Non-Rolling upgrades. > --- > > Key: HDDS-3698 > URL: https://issues.apache.org/jira/browse/HDDS-3698 > Project: Hadoop Distributed Data Store > Issue Type: New Feature >Reporter: Aravindan Vijayan >Assignee: Aravindan Vijayan >Priority: Major > Attachments: Ozone Non-Rolling Upgrades.pdf > > > Support for Non-rolling upgrades in Ozone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2665) Implement new Ozone Filesystem scheme ofs://
[ https://issues.apache.org/jira/browse/HDDS-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099614#comment-17099614 ] Yiqun Lin edited comment on HDDS-2665 at 5/5/20, 7:49 AM: -- Hi [~smeng], I go through the implementation details of new ofs scheme. It's very similar to o3fs we currently implement but will have a better performance than single bucket way in o3fs. So I wonder if we will replace o3fs to use ofs once ofs schema is fully implemented? Or both keeping these two schemas and letting users to choose which one they prefer to use? was (Author: linyiqun): Hi [~smeng], I go though the implementation details of new ofs scheme. It's very similar to o3fs we currently implement but will have a better performance than single bucket way in o3fs. So I wonder if we will replace o3fs to use ofs once ofs schema is fully implemented? Or both keeping these two schemas and letting users to choose which one they prefer to use? > Implement new Ozone Filesystem scheme ofs:// > > > Key: HDDS-2665 > URL: https://issues.apache.org/jira/browse/HDDS-2665 > Project: Hadoop Distributed Data Store > Issue Type: New Feature >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Attachments: Design ofs v1.pdf > > Time Spent: 40m > Remaining Estimate: 0h > > Implement a new scheme for Ozone Filesystem where all volumes (and buckets) > can be access from a single root. > Alias: Rooted Ozone Filesystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2665) Implement new Ozone Filesystem scheme ofs://
[ https://issues.apache.org/jira/browse/HDDS-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099614#comment-17099614 ] Yiqun Lin commented on HDDS-2665: - Hi [~smeng], I go though the implementation details of new ofs scheme. It's very similar to o3fs we currently implement but will have a better performance than single bucket way in o3fs. So I wonder if we will replace o3fs to use ofs once ofs schema is fully implemented? Or both keeping these two schemas and letting users to choose which one they prefer to use? > Implement new Ozone Filesystem scheme ofs:// > > > Key: HDDS-2665 > URL: https://issues.apache.org/jira/browse/HDDS-2665 > Project: Hadoop Distributed Data Store > Issue Type: New Feature >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Attachments: Design ofs v1.pdf > > Time Spent: 40m > Remaining Estimate: 0h > > Implement a new scheme for Ozone Filesystem where all volumes (and buckets) > can be access from a single root. > Alias: Rooted Ozone Filesystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072660#comment-17072660 ] Yiqun Lin commented on HDDS-3241: - {quote} Fix me if I am wrong, but in this case the containers are not unknown but additional replicas are detected (unless the full container is deleted in the mean time). {quote} I mean sometimes DN still contain stale containers that SCM already deleted. {quote} I am not sure if I understood, if some of the containers are valid, but some others are invalid, containers can be deleted. {quote} If we startup a completely wrong SCM, it almost cannot exit safemode I think. So I assume unknown deletion behavior can be safe. But as you mentioned, if only some containers are invalid, it can still be deleted. > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only prints error log and doesn't > take any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Comment Edited] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069198#comment-17069198 ] Yiqun Lin edited comment on HDDS-3241 at 3/28/20, 2:55 AM: --- Thanks for the comments, [~elek] / [~msingh]. Actually current SCM safemode can also ensure this behavior is safe enough once we startup SCM with wrong container/pipeline db files. And then leads large containers deleted. This should not happen because SCM won't exit safemode firstly since DN containers reported will not reach the safemode threshold anyway. Also I have mentioned another case that in large clusters, the node sent to repair and come back to cluster again. SCM deletion behavior can help automation cleanup Datanode stale container datas. This is also one common cases. I have updated the PR to make this configurable and disabled by default. Please help have a look, thanks. was (Author: linyiqun): Thanks for the comments, [~elek] / [~msingh]. Actually current SCM safemode can also protect this behavior once we startup SCM with wrong container/pipeline db files. And then leads large containers deleted. This should not happen because SCM won't exit safemode firstly since DN containers reported will not reach the safemode threshold anyway. Also I have mentioned another case that in large clusters, the node sent to repair and come back to cluster again. SCM deletion behavior can help automation cleanup Datanode stale container datas. This is also one common cases. I have updated the PR to make this configurable and disabled by default. Please help have a look, thanks. > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only prints error log and doesn't > take any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at >
[jira] [Commented] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069198#comment-17069198 ] Yiqun Lin commented on HDDS-3241: - Thanks for the comments, [~elek] / [~msingh]. Actually current SCM safemode can also protect this behavior once we startup SCM with wrong container/pipeline db files. And then leads large containers deleted. This should not happen because SCM won't exit safemode firstly since DN containers reported will not reach the safemode threshold anyway. Also I have mentioned another case that in large clusters, the node sent to repair and come back to cluster again. SCM deletion behavior can help automation cleanup Datanode stale container datas. This is also one common cases. I have updated the PR to make this configurable and disabled by default. Please help have a look, thanks. > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only prints error log and doesn't > take any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at >
[jira] [Comment Edited] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063440#comment-17063440 ] Yiqun Lin edited comment on HDDS-3241 at 3/20/20, 3:02 PM: --- Hi [~elek], In HDFS world, invalid blocks reported to NameNode will be deleted. NameNode will reply DataNode with block deletion commands. So I think this should be same for Ozone. But maybe deleting container will be a more expensive way since it stores more data. But I want to say, we could have a setting to control this action. By default, we keep current logic and just log an error. This just depends on the users that how they want SCM to do. For example, I rebuild my test cluster and still use same Datanode. And these Datanodes always keep reporting stale containers. Yes, I can deletie these containers manually. But it would be better if SCM can help send deletion containers to these Datanodes. was (Author: linyiqun): Hi [~elek], In HDFS world, invalid blocks reported to NameNode will be deleted. NameNode will reply DataNode with block deletion commands. So I think this should be same for Ozone. But maybe deleting container will be a more expensive way since it stores more data. But I want to say, we could have a setting to control this action. By default, we keep current logic and just log an error. This just depends on the users that how they want SCM to do. For example, I rebuild my test cluster and still uses previous Datanode. And these Datanodes keep reporting stale containers. Maybe I can deletion these containers manually. This would be better if SCM can help send deletion containers to these Datanodes. > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only prints error log and doesn't > take any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at >
[jira] [Commented] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063440#comment-17063440 ] Yiqun Lin commented on HDDS-3241: - Hi [~elek], In HDFS world, invalid blocks reported to NameNode will be deleted. NameNode will reply DataNode with block deletion commands. So I think this should be same for Ozone. But maybe deleting container will be a more expensive way since it stores more data. But I want to say, we could have a setting to control this action. By default, we keep current logic and just log an error. This just depends on the users that how they want SCM to do. For example, I rebuild my test cluster and still uses previous Datanode. And these Datanodes keep reporting stale containers. Maybe I can deletion these containers manually. This would be better if SCM can help send deletion containers to these Datanodes. > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only prints error log and doesn't > take any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at >
[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3241: Description: For the invalid or out-updated container reported by Datanode, ContainerReportHandler in SCM only prints error log and doesn't take any action. {noformat} 2020-03-15 05:19:41,072 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 37 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #37 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-03-15 05:19:41,073 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 38 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #38 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} Actually SCM should inform Datanode to delete its outdated container. Otherwise, Datanode will always report this invalid container and this dirty container data will be always kept in Datanode. Sometimes, we bring back a node that be repaired and it maybe stores stale data and we should have a way to auto-cleanup them. We could have a setting to control this auto-deletion behavior if this is a little risk approach. was: For the invalid or out-updated container reported by Datanode, ContainerReportHandler in SCM only prints error log and doesn't take any action. {noformat} 2020-03-15 05:19:41,072 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 37 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #37 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at
[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3241: Description: For the invalid or out-updated container reported by Datanode, ContainerReportHandler in SCM only prints error log and doesn't take any action. {noformat} 2020-03-15 05:19:41,072 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 37 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #37 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-03-15 05:19:41,073 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 38 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #38 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} Actually SCM should inform Datanode to delete its outdated container. Otherwise, Datanode will always report this invalid container and this dirty container data will be always kept in Datanode. Sometimes, we bring back a node that be repaired and it maybe stores stale data. We could have a setting to control this auto-deletion behavior if this is a little risk approach. was: For the invalid or out-updated container reported by Datanode, ContainerReportHandler in SCM only print error log and doesn't any action. {noformat} 2020-03-15 05:19:41,072 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 37 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #37 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at
[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3241: Status: Patch Available (was: Open) > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only print error log and doesn't any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > Actually SCM should inform Datanode to delete its outdated container. > Otherwise, Datanode will always report this invalid container and this dirty > container data will be always kept in Datanode. Sometimes, we bring back a > node that be repaired and it maybe stores stale data. > We could have a setting to control this auto-deletion behavior if this is a > little risk approach. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted
[ https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3241: Affects Version/s: 0.4.1 > Invalid container reported to SCM should be deleted > --- > > Key: HDDS-3241 > URL: https://issues.apache.org/jira/browse/HDDS-3241 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > > For the invalid or out-updated container reported by Datanode, > ContainerReportHandler in SCM only print error log and doesn't any action. > {noformat} > 2020-03-15 05:19:41,072 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 37 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #37 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-03-15 05:19:41,073 ERROR > org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received > container report for an unknown container 38 from datanode > 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, > networkLocation: /dc2/rack1, certSerialId: null}. > org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container > with id #38 not found. > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) > at > org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) > at > org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) > at > org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) > at > org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) > at > org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) > at > org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > Actually SCM should inform Datanode to delete its outdated container. > Otherwise, Datanode will always report this invalid container and this dirty > container data will be always kept in Datanode. Sometimes, we bring back a > node that be repaired and it maybe stores stale data. > We could have a setting to control this auto-deletion behavior if this is a > little risk approach. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-3241) Invalid container reported to SCM should be deleted
Yiqun Lin created HDDS-3241: --- Summary: Invalid container reported to SCM should be deleted Key: HDDS-3241 URL: https://issues.apache.org/jira/browse/HDDS-3241 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Yiqun Lin Assignee: Yiqun Lin For the invalid or out-updated container reported by Datanode, ContainerReportHandler in SCM only print error log and doesn't any action. {noformat} 2020-03-15 05:19:41,072 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 37 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #37 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-03-15 05:19:41,073 ERROR org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container report for an unknown container 38 from datanode 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, networkLocation: /dc2/rack1, certSerialId: null}. org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with id #38 not found. at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542) at org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204) at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97) at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46) at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} Actually SCM should inform Datanode to delete its outdated container. Otherwise, Datanode will always report this invalid container and this dirty container data will be always kept in Datanode. Sometimes, we bring back a node that be repaired and it maybe stores stale data. We could have a setting to control this auto-deletion behavior if this is a little risk approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state
[ https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060552#comment-17060552 ] Yiqun Lin commented on HDDS-3180: - Thanks [~xyao] for the review and merge. > Datanode fails to start due to confused inconsistent volume state > - > > Key: HDDS-3180 > URL: https://issues.apache.org/jira/browse/HDDS-3180 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > I meet an error in my testing ozone cluster when I restart datanode. From the > log, it throws inconsistent volume state but without other detailed helpful > info: > {noformat} > 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered > UNIX signal handlers for [TERM, HUP, INT] > 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - > HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx > 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating > Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : > 20063645696 > 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed > to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data > java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading > volume: /tmp/hadoop-hdfs/dfs/data/hdds > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) > at picocli.CommandLine.execute(CommandLine.java:1173) > at picocli.CommandLine.access$800(CommandLine.java:141) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) > at > picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) > at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) > at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137) > 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO (LogAdapter.java:51) - > SHUTDOWN_MSG: > {noformat} > Then I look into the code and the root cause is that the version file was > lost in that node. > We need to log key message as well to help user quickly know the root cause > of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state
[ https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3180: Status: Patch Available (was: Open) > Datanode fails to start due to confused inconsistent volume state > - > > Key: HDDS-3180 > URL: https://issues.apache.org/jira/browse/HDDS-3180 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I meet an error in my testing ozone cluster when I restart datanode. From the > log, it throws inconsistent volume state but without other detailed helpful > info: > {noformat} > 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered > UNIX signal handlers for [TERM, HUP, INT] > 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - > HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx > 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating > Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : > 20063645696 > 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed > to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data > java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading > volume: /tmp/hadoop-hdfs/dfs/data/hdds > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) > at picocli.CommandLine.execute(CommandLine.java:1173) > at picocli.CommandLine.access$800(CommandLine.java:141) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) > at > picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) > at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) > at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137) > 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO (LogAdapter.java:51) - > SHUTDOWN_MSG: > {noformat} > Then I look into the code and the root cause is that the version file was > lost in that node. > We need to log key message as well to help user quickly know the root cause > of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state
[ https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059327#comment-17059327 ] Yiqun Lin edited comment on HDDS-3180 at 3/14/20, 12:29 PM: We need to additionally add log for the inconsistent state because this state will lead Datanode failed to start. A more friendly message tested in my local: {noformat} 2020-03-14 04:41:27,249 [main] INFO (HddsVolume.java:177) - Creating Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 9997713408 2020-03-14 04:41:27,250 [main] WARN (HddsVolume.java:252) - VERSION file does not exist in volume /tmp/hadoop-hdfs/dfs/data/hdds, current volume state: INCONSISTENT. 2020-03-14 04:41:27,257 [main] ERROR (MutableVolumeSet.java:202) - Failed to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading volume: /tmp/hadoop-hdfs/dfs/data/hdds at org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) at org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) at org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) at org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) at org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) {noformat} was (Author: linyiqun): We need to additionally add log for the inconsistent state because this state will lead Datanode failed to start. > Datanode fails to start due to confused inconsistent volume state > - > > Key: HDDS-3180 > URL: https://issues.apache.org/jira/browse/HDDS-3180 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I meet an error in my testing ozone cluster when I restart datanode. From the > log, it throws inconsistent volume state but without other detailed helpful > info: > {noformat} > 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered > UNIX signal handlers for [TERM, HUP, INT] > 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - > HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx > 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating > Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : > 20063645696 > 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed > to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data > java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading > volume: /tmp/hadoop-hdfs/dfs/data/hdds > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) > at picocli.CommandLine.execute(CommandLine.java:1173) > at picocli.CommandLine.access$800(CommandLine.java:141) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) > at > picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) >
[jira] [Commented] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state
[ https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059327#comment-17059327 ] Yiqun Lin commented on HDDS-3180: - We need to additionally add log for the inconsistent state because this state will lead Datanode failed to start. > Datanode fails to start due to confused inconsistent volume state > - > > Key: HDDS-3180 > URL: https://issues.apache.org/jira/browse/HDDS-3180 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I meet an error in my testing ozone cluster when I restart datanode. From the > log, it throws inconsistent volume state but without other detailed helpful > info: > {noformat} > 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered > UNIX signal handlers for [TERM, HUP, INT] > 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - > HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx > 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating > Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : > 20063645696 > 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed > to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data > java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading > volume: /tmp/hadoop-hdfs/dfs/data/hdds > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) > at picocli.CommandLine.execute(CommandLine.java:1173) > at picocli.CommandLine.access$800(CommandLine.java:141) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) > at > picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) > at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) > at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137) > 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO (LogAdapter.java:51) - > SHUTDOWN_MSG: > {noformat} > Then I look into the code and the root cause is that the version file was > lost in that node. > We need to log key message as well to help user quickly know the root cause > of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state
[ https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3180: Summary: Datanode fails to start due to confused inconsistent volume state (was: Datanode fails to start due to inconsistent volume state without helpful error message) > Datanode fails to start due to confused inconsistent volume state > - > > Key: HDDS-3180 > URL: https://issues.apache.org/jira/browse/HDDS-3180 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > > I meet an error in my testing ozone cluster when I restart datanode. From the > log, it throws inconsistent volume state but without other detailed helpful > info: > {noformat} > 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered > UNIX signal handlers for [TERM, HUP, INT] > 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - > HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx > 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating > Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : > 20063645696 > 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed > to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data > java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading > volume: /tmp/hadoop-hdfs/dfs/data/hdds > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) > at picocli.CommandLine.execute(CommandLine.java:1173) > at picocli.CommandLine.access$800(CommandLine.java:141) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) > at > picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) > at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) > at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137) > 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO (LogAdapter.java:51) - > SHUTDOWN_MSG: > {noformat} > Then I look into the code and the root cause is that the version file was > lost in that node. > We need to log key message as well to help user quickly know the root cause > of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3180) Datanode fails to start due to inconsistent volume state without helpful error message
[ https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3180: Summary: Datanode fails to start due to inconsistent volume state without helpful error message (was: Datanode shutdown due to inconsistent volume state without helpful error message) > Datanode fails to start due to inconsistent volume state without helpful > error message > -- > > Key: HDDS-3180 > URL: https://issues.apache.org/jira/browse/HDDS-3180 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > > I meet an error in my testing ozone cluster when I restart datanode. From the > log, it throws inconsistent volume state but without other detailed helpful > info: > {noformat} > 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered > UNIX signal handlers for [TERM, HUP, INT] > 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - > HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx > 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating > Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : > 20063645696 > 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed > to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data > java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading > volume: /tmp/hadoop-hdfs/dfs/data/hdds > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) > at > org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) > at > org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) > at picocli.CommandLine.execute(CommandLine.java:1173) > at picocli.CommandLine.access$800(CommandLine.java:141) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) > at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) > at > picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) > at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) > at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137) > 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO (LogAdapter.java:51) - > SHUTDOWN_MSG: > {noformat} > Then I look into the code and the root cause is that the version file was > lost in that node. > We need to log key message as well to help user quickly know the root cause > of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-3180) Datanode shutdown due to inconsistent volume state without helpful error message
Yiqun Lin created HDDS-3180: --- Summary: Datanode shutdown due to inconsistent volume state without helpful error message Key: HDDS-3180 URL: https://issues.apache.org/jira/browse/HDDS-3180 Project: Hadoop Distributed Data Store Issue Type: Improvement Affects Versions: 0.4.1 Reporter: Yiqun Lin Assignee: Yiqun Lin I meet an error in my testing ozone cluster when I restart datanode. From the log, it throws inconsistent volume state but without other detailed helpful info: {noformat} 2020-03-14 02:31:46,204 [main] INFO (LogAdapter.java:51) - registered UNIX signal handlers for [TERM, HUP, INT] 2020-03-14 02:31:46,736 [main] INFO (HddsDatanodeService.java:204) - HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx 2020-03-14 02:31:46,784 [main] INFO (HddsVolume.java:177) - Creating Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 20063645696 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading volume: /tmp/hadoop-hdfs/dfs/data/hdds at org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226) at org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180) at org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71) at org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158) at org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336) at org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183) at org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139) at org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111) at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97) at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128) at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235) at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179) at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154) at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78) at picocli.CommandLine.execute(CommandLine.java:1173) at picocli.CommandLine.access$800(CommandLine.java:141) at picocli.CommandLine$RunLast.handle(CommandLine.java:1367) at picocli.CommandLine$RunLast.handle(CommandLine.java:1335) at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243) at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526) at picocli.CommandLine.parseWithHandler(CommandLine.java:1465) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56) at org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137) 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO (LogAdapter.java:51) - SHUTDOWN_MSG: {noformat} Then I look into the code and the root cause is that the version file was lost in that node. We need to log key message as well to help user quickly know the root cause of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3111) Add unit test for container replication behavior under different container placement policy
[ https://issues.apache.org/jira/browse/HDDS-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3111: Status: Patch Available (was: Open) > Add unit test for container replication behavior under different container > placement policy > --- > > Key: HDDS-3111 > URL: https://issues.apache.org/jira/browse/HDDS-3111 > Project: Hadoop Distributed Data Store > Issue Type: Test >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, the unit test for ReplicationManager only tested for container > state change and container placement policy only focus on the policy > algorithm. > And we lack of one integration unit test for testing container replication > behavior under different container placement policy. Including some corner > cases, like not enough candidate node, fallback cases in rack awareness > policy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-3111) Add unit test for container replication behavior under different container placement policy
Yiqun Lin created HDDS-3111: --- Summary: Add unit test for container replication behavior under different container placement policy Key: HDDS-3111 URL: https://issues.apache.org/jira/browse/HDDS-3111 Project: Hadoop Distributed Data Store Issue Type: Test Reporter: Yiqun Lin Assignee: Yiqun Lin Currently, the unit test for ReplicationManager only tested for container state change and container placement policy only focus on the policy algorithm. And we lack of one integration unit test for testing container replication behavior under different container placement policy. Including some corner cases, like not enough candidate node, fallback cases in rack awareness policy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API
[ https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045595#comment-17045595 ] Yiqun Lin edited comment on HDDS-3058 at 2/26/20 2:49 PM: -- I verified the change in my local, it will break current ozone fs put command. Above methods will be triggered in process of put command. I'd like to close this JIRA as invalidate since I cannot find a better way to solve this temperatory, :D. {quote} Yiqun Lin, thanks for reporting this. We plan to improve FS API under a umbrella JIRA HDDS-3048. Feel free to join us if you have interest. {quote} I will take a look for that, thanks for the reference, sammi! was (Author: linyiqun): I verified the change in my local, it will break current ozone fs put command. Above methods will be triggered in process of put command. I'd like to close this JIRA as invalidate since I cannot find a better way to solve this, :D. {quote} Yiqun Lin, thanks for reporting this. We plan to improve FS API under a umbrella JIRA HDDS-3048. Feel free to join us if you have interest. {quote} I will take a look for that, thanks for the reference, sammi! > OzoneFileSystem should override unsupported set type FileSystem API > --- > > Key: HDDS-3058 > URL: https://issues.apache.org/jira/browse/HDDS-3058 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Filesystem >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, OzoneFileSystem only implements some common useful FileSystem APIs > and most of other API are not supported and inherited from parent class > FileSystem by default. However, FileSystem do nothing in some set type > method, like setReplication, setOwner. > {code:java} > public void setVerifyChecksum(boolean verifyChecksum) { > //doesn't do anything > } > public void setWriteChecksum(boolean writeChecksum) { > //doesn't do anything > } > public boolean setReplication(Path src, short replication) > throws IOException { > return true; > } > public void setPermission(Path p, FsPermission permission > ) throws IOException { > } > public void setOwner(Path p, String username, String groupname > ) throws IOException { > } > public void setTimes(Path p, long mtime, long atime > ) throws IOException { > } > {code} > This set type functions depend on the sub-filesystem implementation. We need > to to throw unsupported exception if sub-filesystem cannot support this. > Otherwise, it will make users confused to use hadoop fs -setrep command or > call setReplication api. Users will not see any exception but the command/API > can execute fine. This is happened when I tested for the OzoneFileSystem via > hadoop fs command way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API
[ https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3058: Resolution: Invalid Status: Resolved (was: Patch Available) I verified the change in my local, it will break current ozone fs put command. Above methods will be triggered in process of put command. I'd like to close this JIRA as invalidate since I cannot find a better way to solve this, :D. {quote} Yiqun Lin, thanks for reporting this. We plan to improve FS API under a umbrella JIRA HDDS-3048. Feel free to join us if you have interest. {quote} I will take a look for that, thanks for the reference, sammi! > OzoneFileSystem should override unsupported set type FileSystem API > --- > > Key: HDDS-3058 > URL: https://issues.apache.org/jira/browse/HDDS-3058 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Filesystem >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, OzoneFileSystem only implements some common useful FileSystem APIs > and most of other API are not supported and inherited from parent class > FileSystem by default. However, FileSystem do nothing in some set type > method, like setReplication, setOwner. > {code:java} > public void setVerifyChecksum(boolean verifyChecksum) { > //doesn't do anything > } > public void setWriteChecksum(boolean writeChecksum) { > //doesn't do anything > } > public boolean setReplication(Path src, short replication) > throws IOException { > return true; > } > public void setPermission(Path p, FsPermission permission > ) throws IOException { > } > public void setOwner(Path p, String username, String groupname > ) throws IOException { > } > public void setTimes(Path p, long mtime, long atime > ) throws IOException { > } > {code} > This set type functions depend on the sub-filesystem implementation. We need > to to throw unsupported exception if sub-filesystem cannot support this. > Otherwise, it will make users confused to use hadoop fs -setrep command or > call setReplication api. Users will not see any exception but the command/API > can execute fine. This is happened when I tested for the OzoneFileSystem via > hadoop fs command way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3070) NPE when stop recon server while recon server was not really started before
[ https://issues.apache.org/jira/browse/HDDS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3070: Status: Patch Available (was: Open) > NPE when stop recon server while recon server was not really started before > --- > > Key: HDDS-3070 > URL: https://issues.apache.org/jira/browse/HDDS-3070 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Recon >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I met a NPE error when I did test for Ozone. Seems the root cause is that > recon server was not really started however we still try to stop it. > {noformat} > 2020-02-25 20:22:44,296 [Thread-0] ERROR ozone.MiniOzoneClusterImpl > (MiniOzoneClusterImpl.java:build(525)) - Exception while shutting down the > Recon. > java.lang.NullPointerException > at > org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.stop(ReconTaskControllerImpl.java:237) > at > org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.stop(OzoneManagerServiceProviderImpl.java:229) > at org.apache.hadoop.ozone.recon.ReconServer.stop(ReconServer.java:132) > at > org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopRecon(MiniOzoneClusterImpl.java:470) > at > org.apache.hadoop.ozone.MiniOzoneClusterImpl.access$200(MiniOzoneClusterImpl.java:87) > at > org.apache.hadoop.ozone.MiniOzoneClusterImpl$Builder.build(MiniOzoneClusterImpl.java:523) > at > org.apache.hadoop.fs.ozone.TestOzoneFileSystem.testFileSystem(TestOzoneFileSystem.java:72) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3070) NPE when stop recon server while recon server was not really started before
[ https://issues.apache.org/jira/browse/HDDS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3070: Summary: NPE when stop recon server while recon server was not really started before (was: NPE when stop recon server while recon server was not really started) > NPE when stop recon server while recon server was not really started before > --- > > Key: HDDS-3070 > URL: https://issues.apache.org/jira/browse/HDDS-3070 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Recon >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > > I met a NPE error when I did test for Ozone. Seems the root cause is that > recon server was not really started however we still try to stop it. > {noformat} > 2020-02-25 20:22:44,296 [Thread-0] ERROR ozone.MiniOzoneClusterImpl > (MiniOzoneClusterImpl.java:build(525)) - Exception while shutting down the > Recon. > java.lang.NullPointerException > at > org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.stop(ReconTaskControllerImpl.java:237) > at > org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.stop(OzoneManagerServiceProviderImpl.java:229) > at org.apache.hadoop.ozone.recon.ReconServer.stop(ReconServer.java:132) > at > org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopRecon(MiniOzoneClusterImpl.java:470) > at > org.apache.hadoop.ozone.MiniOzoneClusterImpl.access$200(MiniOzoneClusterImpl.java:87) > at > org.apache.hadoop.ozone.MiniOzoneClusterImpl$Builder.build(MiniOzoneClusterImpl.java:523) > at > org.apache.hadoop.fs.ozone.TestOzoneFileSystem.testFileSystem(TestOzoneFileSystem.java:72) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-3070) NPE when stop recon server while recon server was not really started
Yiqun Lin created HDDS-3070: --- Summary: NPE when stop recon server while recon server was not really started Key: HDDS-3070 URL: https://issues.apache.org/jira/browse/HDDS-3070 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Recon Affects Versions: 0.4.1 Reporter: Yiqun Lin Assignee: Yiqun Lin I met a NPE error when I did test for Ozone. Seems the root cause is that recon server was not really started however we still try to stop it. {noformat} 2020-02-25 20:22:44,296 [Thread-0] ERROR ozone.MiniOzoneClusterImpl (MiniOzoneClusterImpl.java:build(525)) - Exception while shutting down the Recon. java.lang.NullPointerException at org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.stop(ReconTaskControllerImpl.java:237) at org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.stop(OzoneManagerServiceProviderImpl.java:229) at org.apache.hadoop.ozone.recon.ReconServer.stop(ReconServer.java:132) at org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopRecon(MiniOzoneClusterImpl.java:470) at org.apache.hadoop.ozone.MiniOzoneClusterImpl.access$200(MiniOzoneClusterImpl.java:87) at org.apache.hadoop.ozone.MiniOzoneClusterImpl$Builder.build(MiniOzoneClusterImpl.java:523) at org.apache.hadoop.fs.ozone.TestOzoneFileSystem.testFileSystem(TestOzoneFileSystem.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API
[ https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3058: Status: Patch Available (was: Open) > OzoneFileSystem should override unsupported set type FileSystem API > --- > > Key: HDDS-3058 > URL: https://issues.apache.org/jira/browse/HDDS-3058 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Filesystem >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, OzoneFileSystem only implements some common useful FileSystem APIs > and most of other API are not supported and inherited from parent class > FileSystem by default. However, FileSystem do nothing in some set type > method, like setReplication, setOwner. > {code:java} > public void setVerifyChecksum(boolean verifyChecksum) { > //doesn't do anything > } > public void setWriteChecksum(boolean writeChecksum) { > //doesn't do anything > } > public boolean setReplication(Path src, short replication) > throws IOException { > return true; > } > public void setPermission(Path p, FsPermission permission > ) throws IOException { > } > public void setOwner(Path p, String username, String groupname > ) throws IOException { > } > public void setTimes(Path p, long mtime, long atime > ) throws IOException { > } > {code} > This set type functions depend on the sub-filesystem implementation. We need > to to throw unsupported exception if sub-filesystem cannot support this. > Otherwise, it will make users confused to use hadoop fs -setrep command or > call setReplication api. Users will not see any exception but the command/API > can execute fine. This is happened when I tested for the OzoneFileSystem via > hadoop fs command way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API
Yiqun Lin created HDDS-3058: --- Summary: OzoneFileSystem should override unsupported set type FileSystem API Key: HDDS-3058 URL: https://issues.apache.org/jira/browse/HDDS-3058 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Filesystem Affects Versions: 0.4.1 Reporter: Yiqun Lin Assignee: Yiqun Lin Currently, OzoneFileSystem only implements some common useful FileSystem APIs and most of other API are not supported and inherited from parent class FileSystem by default. However, FileSystem do nothing in some set type method, like setReplication, setOwner. {code} public void setVerifyChecksum(boolean verifyChecksum) { //doesn't do anything } public void setWriteChecksum(boolean writeChecksum) { //doesn't do anything } public boolean setReplication(Path src, short replication) throws IOException { return true; } public void setPermission(Path p, FsPermission permission ) throws IOException { } public void setOwner(Path p, String username, String groupname ) throws IOException { } public void setTimes(Path p, long mtime, long atime ) throws IOException { } {code} This set type functions depend on the sub-filesystem implementation. We need to to throw unsupported exception if sub-filesystem cannot support this. Otherwise, it will make users confused to use hadoop fs -setrep command or call setReplication api. Users will not see any exception but the command can execute fine. This is happened when I tested for the OzoneFileSystem via hadoop fs command way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API
[ https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3058: Description: Currently, OzoneFileSystem only implements some common useful FileSystem APIs and most of other API are not supported and inherited from parent class FileSystem by default. However, FileSystem do nothing in some set type method, like setReplication, setOwner. {code:java} public void setVerifyChecksum(boolean verifyChecksum) { //doesn't do anything } public void setWriteChecksum(boolean writeChecksum) { //doesn't do anything } public boolean setReplication(Path src, short replication) throws IOException { return true; } public void setPermission(Path p, FsPermission permission ) throws IOException { } public void setOwner(Path p, String username, String groupname ) throws IOException { } public void setTimes(Path p, long mtime, long atime ) throws IOException { } {code} This set type functions depend on the sub-filesystem implementation. We need to to throw unsupported exception if sub-filesystem cannot support this. Otherwise, it will make users confused to use hadoop fs -setrep command or call setReplication api. Users will not see any exception but the command/API can execute fine. This is happened when I tested for the OzoneFileSystem via hadoop fs command way. was: Currently, OzoneFileSystem only implements some common useful FileSystem APIs and most of other API are not supported and inherited from parent class FileSystem by default. However, FileSystem do nothing in some set type method, like setReplication, setOwner. {code} public void setVerifyChecksum(boolean verifyChecksum) { //doesn't do anything } public void setWriteChecksum(boolean writeChecksum) { //doesn't do anything } public boolean setReplication(Path src, short replication) throws IOException { return true; } public void setPermission(Path p, FsPermission permission ) throws IOException { } public void setOwner(Path p, String username, String groupname ) throws IOException { } public void setTimes(Path p, long mtime, long atime ) throws IOException { } {code} This set type functions depend on the sub-filesystem implementation. We need to to throw unsupported exception if sub-filesystem cannot support this. Otherwise, it will make users confused to use hadoop fs -setrep command or call setReplication api. Users will not see any exception but the command can execute fine. This is happened when I tested for the OzoneFileSystem via hadoop fs command way. > OzoneFileSystem should override unsupported set type FileSystem API > --- > > Key: HDDS-3058 > URL: https://issues.apache.org/jira/browse/HDDS-3058 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Filesystem >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > > Currently, OzoneFileSystem only implements some common useful FileSystem APIs > and most of other API are not supported and inherited from parent class > FileSystem by default. However, FileSystem do nothing in some set type > method, like setReplication, setOwner. > {code:java} > public void setVerifyChecksum(boolean verifyChecksum) { > //doesn't do anything > } > public void setWriteChecksum(boolean writeChecksum) { > //doesn't do anything > } > public boolean setReplication(Path src, short replication) > throws IOException { > return true; > } > public void setPermission(Path p, FsPermission permission > ) throws IOException { > } > public void setOwner(Path p, String username, String groupname > ) throws IOException { > } > public void setTimes(Path p, long mtime, long atime > ) throws IOException { > } > {code} > This set type functions depend on the sub-filesystem implementation. We need > to to throw unsupported exception if sub-filesystem cannot support this. > Otherwise, it will make users confused to use hadoop fs -setrep command or > call setReplication api. Users will not see any exception but the command/API > can execute fine. This is happened when I tested for the OzoneFileSystem via > hadoop fs command way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API
[ https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-3058: Issue Type: Bug (was: Improvement) > OzoneFileSystem should override unsupported set type FileSystem API > --- > > Key: HDDS-3058 > URL: https://issues.apache.org/jira/browse/HDDS-3058 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Filesystem >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > > Currently, OzoneFileSystem only implements some common useful FileSystem APIs > and most of other API are not supported and inherited from parent class > FileSystem by default. However, FileSystem do nothing in some set type > method, like setReplication, setOwner. > {code} > public void setVerifyChecksum(boolean verifyChecksum) { > //doesn't do anything > } > public void setWriteChecksum(boolean writeChecksum) { > //doesn't do anything > } > public boolean setReplication(Path src, short replication) > throws IOException { > return true; > } > public void setPermission(Path p, FsPermission permission > ) throws IOException { > } > public void setOwner(Path p, String username, String groupname > ) throws IOException { > } > public void setTimes(Path p, long mtime, long atime > ) throws IOException { > } > {code} > This set type functions depend on the sub-filesystem implementation. We need > to to throw unsupported exception if sub-filesystem cannot support this. > Otherwise, it will make users confused to use hadoop fs -setrep command or > call setReplication api. Users will not see any exception but the command can > execute fine. This is happened when I tested for the OzoneFileSystem via > hadoop fs command way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2972) Any container replication error can terminate SCM service
[ https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2972: Description: I found there any container replication error thrown in ReplicationManager can terminates SCM service. It's a very expensive behavior to terminate the SCM service just because of one container replication error. It's not worth to shutdown the SCM. We can be friendly to deal with this, catch the exception and print the warn message with thrown exception. The shutdown info: {noformat} 2020-01-30 08:16:04,705 ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in Replication Monitor Thread. java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789) at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173) at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515) at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311) at java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223) at java.lang.Thread.run(Thread.java:745) 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology 2020-01-30 08:16:04,734 INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: {noformat} was: I found there any container replication error running in ReplicationManager can terminates SCM service. It's a very expensive behavior to terminate the SCM service just because of one container replication error. It's not worth to shutdown the SCM. We can be friendly to deal with this, catch the exception and print the warn message with thrown exception. The shutdown info: {noformat} 2020-01-30 08:16:04,705 ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in Replication Monitor Thread. java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789) at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173) at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515) at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311) at java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223) at java.lang.Thread.run(Thread.java:745) 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology 2020-01-30 08:16:04,734 INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: {noformat} > Any container replication error can terminate SCM service > - > > Key: HDDS-2972 > URL: https://issues.apache.org/jira/browse/HDDS-2972 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: SCM >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I found there any container replication error
[jira] [Updated] (HDDS-2972) Any container replication error can terminate SCM service
[ https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2972: Status: Patch Available (was: Open) > Any container replication error can terminate SCM service > - > > Key: HDDS-2972 > URL: https://issues.apache.org/jira/browse/HDDS-2972 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: SCM >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I found there any container replication error running in ReplicationManager > can terminates SCM service. It's a very expensive behavior to terminate the > SCM service just because of one container replication error. > It's not worth to shutdown the SCM. We can be friendly to deal with this, > catch the exception and print the warn message with thrown exception. > The shutdown info: > {noformat} > 2020-01-30 08:16:04,705 ERROR > org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in > Replication Monitor Thread. > java.lang.IllegalArgumentException: Affinity node > /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology > at > org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789) > at > org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399) > at > org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249) > at > org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173) > at > org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515) > at > org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311) > at > java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649) > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) > at > org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223) > at java.lang.Thread.run(Thread.java:745) > 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: java.lang.IllegalArgumentException: Affinity node > /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology > 2020-01-30 08:16:04,734 INFO > org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: > SHUTDOWN_MSG: > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2972) Any container replication error can terminate SCM service
[ https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2972: Summary: Any container replication error can terminate SCM service (was: Any container replication error can terminates SCM service) > Any container replication error can terminate SCM service > - > > Key: HDDS-2972 > URL: https://issues.apache.org/jira/browse/HDDS-2972 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: SCM >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Major > > I found there any container replication error running in ReplicationManager > can terminates SCM service. It's a very expensive behavior to terminate the > SCM service just because of one container replication error. > It's not worth to shutdown the SCM. We can be friendly to deal with this, > catch the exception and print the warn message with thrown exception. > The shutdown info: > {noformat} > 2020-01-30 08:16:04,705 ERROR > org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in > Replication Monitor Thread. > java.lang.IllegalArgumentException: Affinity node > /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology > at > org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789) > at > org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399) > at > org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249) > at > org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173) > at > org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515) > at > org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311) > at > java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649) > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) > at > org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223) > at java.lang.Thread.run(Thread.java:745) > 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: java.lang.IllegalArgumentException: Affinity node > /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology > 2020-01-30 08:16:04,734 INFO > org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: > SHUTDOWN_MSG: > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2972) Any container replication error can terminates SCM service
Yiqun Lin created HDDS-2972: --- Summary: Any container replication error can terminates SCM service Key: HDDS-2972 URL: https://issues.apache.org/jira/browse/HDDS-2972 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: SCM Affects Versions: 0.4.1 Reporter: Yiqun Lin Assignee: Yiqun Lin I found there any container replication error running in ReplicationManager can terminates SCM service. It's a very expensive behavior to terminate the SCM service just because of one container replication error. It's not worth to shutdown the SCM. We can be friendly to deal with this, catch the exception and print the warn message with thrown exception. The shutdown info: {noformat} 2020-01-30 08:16:04,705 ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in Replication Monitor Thread. java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789) at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173) at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515) at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311) at java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223) at java.lang.Thread.run(Thread.java:745) 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology 2020-01-30 08:16:04,734 INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025842#comment-17025842 ] Yiqun Lin edited comment on HDDS-2939 at 1/29/20 12:46 PM: --- Hi [~sdeka], I am reading for this design doc, some comments from me: For the Filesystem Namespace Operations, the ls(list files/folders) operation will also a common operation. But under current implementation, for example, list a directory, we have to traverse the whole directory/file table to lookup the child file/sub-folders.This is an ineffective way. I know the lookup way can greatly reduce the memory used. but this is not friendly for the ls operation. Do we have any other improvement for this? Can we additionally store the child ID for each record in directory table? That can help us quickly find the child file or child folder. {quote}Associating a lock with each parent prefix being accessed by an operation in the OM, is sufficient to control concurrent operations on the same prefix. When the OM starts to process create “/a/b/c/1.txt”, a prefix lock is taken for “/a/b/c”... {quote} For the concurrency control, we create the lock for each parent prefix level. There will be large number of lock instances to be maintained in OM memory once there are millions of directory folders. Current way is so fine-grained lock way, have we considered about the partition namespace way? Divided the whole namespace into logic sub-namespaces by the prefix key. Then each sub-namespace will have its lock. This is a compromise approach than just having a global exclusive lock or having uncontrollable number of locks that depended on parent prefix's number. Is there a future plan to have a way(API or command Tool) to convert object key to Ozone FS namespace? Because object store is now the major use case for the users. Maybe users want to use a filesystem way to access the data without moving their data. was (Author: linyiqun): Hi [~sdeka], I am reading for this design doc, some comments from me: For the Filesystem Namespace Operations, the ls(list files/folders) operation will also a common operation. But under current implementation, for example, list a directory, we have to traverse the whole directory/file table to lookup the child file/sub-folders.This is an ineffective way. Do we have any other improvement for this? Can we additionally store the child ID for each record in directory table? That can help us quickly find the child file or child folder. {quote} Associating a lock with each parent prefix being accessed by an operation in the OM, is sufficient to control concurrent operations on the same prefix. When the OM starts to process create “/a/b/c/1.txt”, a prefix lock is taken for “/a/b/c”... {quote} For the concurrency control, we create the lock for each parent prefix level. There will be large number of lock instances to be maintained in OM memory once there are millions of directory folders. Current way is so fine-grained lock way, have we considered about the partition namespace way? Divided the whole namespace into logic sub-namespaces by the prefix key. Then each sub-namespace will have its lock. This is a compromise approach than just having a global exclusive lock or having uncontrollable number of locks that depended on parent prefix's number. Is there a future plan to have a way(API or command Tool) to convert object key to Ozone FS namespace? Because object store is now the major use case for the users. Maybe users want to use a filesystem way to access the data without moving their data. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Supratim Deka >Priority: Major > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2939) Ozone FS namespace
[ https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025842#comment-17025842 ] Yiqun Lin commented on HDDS-2939: - Hi [~sdeka], I am reading for this design doc, some comments from me: For the Filesystem Namespace Operations, the ls(list files/folders) operation will also a common operation. But under current implementation, for example, list a directory, we have to traverse the whole directory/file table to lookup the child file/sub-folders.This is an ineffective way. Do we have any other improvement for this? Can we additionally store the child ID for each record in directory table? That can help us quickly find the child file or child folder. {quote} Associating a lock with each parent prefix being accessed by an operation in the OM, is sufficient to control concurrent operations on the same prefix. When the OM starts to process create “/a/b/c/1.txt”, a prefix lock is taken for “/a/b/c”... {quote} For the concurrency control, we create the lock for each parent prefix level. There will be large number of lock instances to be maintained in OM memory once there are millions of directory folders. Current way is so fine-grained lock way, have we considered about the partition namespace way? Divided the whole namespace into logic sub-namespaces by the prefix key. Then each sub-namespace will have its lock. This is a compromise approach than just having a global exclusive lock or having uncontrollable number of locks that depended on parent prefix's number. Is there a future plan to have a way(API or command Tool) to convert object key to Ozone FS namespace? Because object store is now the major use case for the users. Maybe users want to use a filesystem way to access the data without moving their data. > Ozone FS namespace > -- > > Key: HDDS-2939 > URL: https://issues.apache.org/jira/browse/HDDS-2939 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Supratim Deka >Priority: Major > Attachments: Ozone FS Namespace Proposal v1.0.docx > > > Create the structures and metadata layout required to support efficient FS > namespace operations in Ozone - operations involving folders/directories > required to support the Hadoop compatible Filesystem interface. > The details are described in the attached document. The work is divided up > into sub-tasks as per the task list in the document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2927) Cache EndPoint tasks instead of creating them all the time in RunningDatanodeState
[ https://issues.apache.org/jira/browse/HDDS-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2927: Summary: Cache EndPoint tasks instead of creating them all the time in RunningDatanodeState (was: Cache EndPoint tasks instead of creating them all the time) > Cache EndPoint tasks instead of creating them all the time in > RunningDatanodeState > -- > > Key: HDDS-2927 > URL: https://issues.apache.org/jira/browse/HDDS-2927 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, we create EndPoint tasks all the time. This is an inefficient way, > we could cache these task as TODO comment suggested. > {code} > //TODO : Cache some of these tasks instead of creating them > //all the time. > private Callable > getEndPointTask(EndpointStateMachine endpoint) { > switch (endpoint.getState()) { > case GETVERSION: > return new VersionEndpointTask(endpoint, conf, context.getParent() > .getContainer()); > case REGISTER: > return RegisterEndpointTask.newBuilder() > .setConfig(conf) > .setEndpointStateMachine(endpoint) > .setContext(context) > .setDatanodeDetails(context.getParent().getDatanodeDetails()) > .setOzoneContainer(context.getParent().getContainer()) > .build(); > case HEARTBEAT: > return HeartbeatEndpointTask.newBuilder() > .setConfig(conf) > .setEndpointStateMachine(endpoint) > .setDatanodeDetails(context.getParent().getDatanodeDetails()) > .setContext(context) > .build(); > case SHUTDOWN: > break; > default: > throw new IllegalArgumentException("Illegal Argument."); > } > return null; >} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2927) Cache EndPoint tasks instead of creating them all the time
[ https://issues.apache.org/jira/browse/HDDS-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2927: Status: Patch Available (was: Open) > Cache EndPoint tasks instead of creating them all the time > -- > > Key: HDDS-2927 > URL: https://issues.apache.org/jira/browse/HDDS-2927 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, we create EndPoint tasks all the time. This is an inefficient way, > we could cache these task as TODO comment suggested. > {code} > //TODO : Cache some of these tasks instead of creating them > //all the time. > private Callable > getEndPointTask(EndpointStateMachine endpoint) { > switch (endpoint.getState()) { > case GETVERSION: > return new VersionEndpointTask(endpoint, conf, context.getParent() > .getContainer()); > case REGISTER: > return RegisterEndpointTask.newBuilder() > .setConfig(conf) > .setEndpointStateMachine(endpoint) > .setContext(context) > .setDatanodeDetails(context.getParent().getDatanodeDetails()) > .setOzoneContainer(context.getParent().getContainer()) > .build(); > case HEARTBEAT: > return HeartbeatEndpointTask.newBuilder() > .setConfig(conf) > .setEndpointStateMachine(endpoint) > .setDatanodeDetails(context.getParent().getDatanodeDetails()) > .setContext(context) > .build(); > case SHUTDOWN: > break; > default: > throw new IllegalArgumentException("Illegal Argument."); > } > return null; >} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2927) Cache EndPoint tasks instead of creating them all the time
Yiqun Lin created HDDS-2927: --- Summary: Cache EndPoint tasks instead of creating them all the time Key: HDDS-2927 URL: https://issues.apache.org/jira/browse/HDDS-2927 Project: Hadoop Distributed Data Store Issue Type: Improvement Reporter: Yiqun Lin Assignee: Yiqun Lin Currently, we create EndPoint tasks all the time. This is an inefficient way, we could cache these task as TODO comment suggested. {code} //TODO : Cache some of these tasks instead of creating them //all the time. private Callable getEndPointTask(EndpointStateMachine endpoint) { switch (endpoint.getState()) { case GETVERSION: return new VersionEndpointTask(endpoint, conf, context.getParent() .getContainer()); case REGISTER: return RegisterEndpointTask.newBuilder() .setConfig(conf) .setEndpointStateMachine(endpoint) .setContext(context) .setDatanodeDetails(context.getParent().getDatanodeDetails()) .setOzoneContainer(context.getParent().getContainer()) .build(); case HEARTBEAT: return HeartbeatEndpointTask.newBuilder() .setConfig(conf) .setEndpointStateMachine(endpoint) .setDatanodeDetails(context.getParent().getDatanodeDetails()) .setContext(context) .build(); case SHUTDOWN: break; default: throw new IllegalArgumentException("Illegal Argument."); } return null; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2910) OzoneManager startup failure with throwing unhelpful exception message
[ https://issues.apache.org/jira/browse/HDDS-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2910: Status: Patch Available (was: Open) > OzoneManager startup failure with throwing unhelpful exception message > -- > > Key: HDDS-2910 > URL: https://issues.apache.org/jira/browse/HDDS-2910 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Testing for the OM HA feature, I update the HA specific configurations and > then start up the OM service. But I find OM is not startup succeed and I > check the log, I get this error info and no any other helpful message. > {noformat} > ... > 2020-01-17 08:57:55,210 [main] INFO - registered UNIX signal handlers > for [TERM, HUP, INT] > 2020-01-17 08:57:55,846 [main] INFO - ozone.om.internal.service.id is > not defined, falling back to ozone.om.service.ids to find serviceID for > OzoneManager if it is HA enabled cluster > 2020-01-17 08:57:55,872 [main] INFO - Found matching OM address with > OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: > lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872 > 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key > ozone.om.http-address with value of key ozone.om.http-address.omNode-1: > lyq-m1-xx.xx.xx.xx:9874 > 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key > ozone.om.https-address with value of key ozone.om.https-address.omNode-1: > lyq-m1-xx.xx.xx.xx:9875 > 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key > ozone.om.address with value of key ozone.om.address.omNode-1: > lyq-m1-xx.xx.xx.xx:9862 > OM not initialized. > 2020-01-17 08:57:55,887 [shutdown-hook-0] INFO - SHUTDOWN_MSG: > {noformat} > "OM not initialized" doesn't give me enough info, then I have to check the > related logic code. Finally, I find I have made a mistake that I forgot to do > the om --init command first before startup the OM. > We can additionally add with suggestion here that will help users quickly > know the error and how to resolved that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2910) OzoneManager startup failure with throwing unhelpful exception message
[ https://issues.apache.org/jira/browse/HDDS-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDDS-2910: Description: Testing for the OM HA feature, I update the HA specific configurations and then start up the OM service. But I find OM is not startup succeed and I check the log, I get this error info and no any other helpful message. {noformat} ... 2020-01-17 08:57:55,210 [main] INFO - registered UNIX signal handlers for [TERM, HUP, INT] 2020-01-17 08:57:55,846 [main] INFO - ozone.om.internal.service.id is not defined, falling back to ozone.om.service.ids to find serviceID for OzoneManager if it is HA enabled cluster 2020-01-17 08:57:55,872 [main] INFO - Found matching OM address with OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key ozone.om.http-address with value of key ozone.om.http-address.omNode-1: lyq-m1-xx.xx.xx.xx:9874 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key ozone.om.https-address with value of key ozone.om.https-address.omNode-1: lyq-m1-xx.xx.xx.xx:9875 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key ozone.om.address with value of key ozone.om.address.omNode-1: lyq-m1-xx.xx.xx.xx:9862 OM not initialized. 2020-01-17 08:57:55,887 [shutdown-hook-0] INFO - SHUTDOWN_MSG: {noformat} "OM not initialized" doesn't give me enough info, then I have to check the related logic code. Finally, I find I have made a mistake that I forgot to do the om --init command first before startup the OM. We can additionally add with suggestion here that will help users quickly know the error and how to resolved that. was: Testing for the OM HA feature, I update the HA specific configurations and then start up the OM service. But I find OM is not startup succeed and I check the log, I get this error info and no any other helpful message. {noformat} ... 2020-01-17 08:57:55,210 [main] INFO - registered UNIX signal handlers for [TERM, HUP, INT] 2020-01-17 08:57:55,846 [main] INFO - ozone.om.internal.service.id is not defined, falling back to ozone.om.service.ids to find serviceID for OzoneManager if it is HA enabled cluster 2020-01-17 08:57:55,872 [main] INFO - Found matching OM address with OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key ozone.om.http-address with value of key ozone.om.http-address.omNode-1: lyq-m1-xx.xx.xx.xx:9874 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key ozone.om.https-address with value of key ozone.om.https-address.omNode-1: lyq-m1-xx.xx.xx.xx:9875 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key ozone.om.address with value of key ozone.om.address.omNode-1: lyq-m1-xx.xx.xx.xx:9862 OM not initialized. 2020-01-17 08:57:55,887 [shutdown-hook-0] INFO - SHUTDOWN_MSG: {noformat} "OM not initialized" doesn't give me enough info, then I have to check the related logic code. Finally, I find I have made a mistake that I forgot to do the om --init command first before startup the OM. We can additionally add with suggestion here that will help users quickly know the error and how to fix that. > OzoneManager startup failure with throwing unhelpful exception message > -- > > Key: HDDS-2910 > URL: https://issues.apache.org/jira/browse/HDDS-2910 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Affects Versions: 0.4.1 >Reporter: Yiqun Lin >Assignee: Yiqun Lin >Priority: Minor > > Testing for the OM HA feature, I update the HA specific configurations and > then start up the OM service. But I find OM is not startup succeed and I > check the log, I get this error info and no any other helpful message. > {noformat} > ... > 2020-01-17 08:57:55,210 [main] INFO - registered UNIX signal handlers > for [TERM, HUP, INT] > 2020-01-17 08:57:55,846 [main] INFO - ozone.om.internal.service.id is > not defined, falling back to ozone.om.service.ids to find serviceID for > OzoneManager if it is HA enabled cluster > 2020-01-17 08:57:55,872 [main] INFO - Found matching OM address with > OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: > lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872 > 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key > ozone.om.http-address with value of key ozone.om.http-address.omNode-1: > lyq-m1-xx.xx.xx.xx:9874 > 2020-01-17 08:57:55,872 [main] INFO - Setting configuration key > ozone.om.https-address with value of key ozone.om.https-address.omNode-1: >
[jira] [Commented] (HDDS-2031) Choose datanode for pipeline creation based on network topology
[ https://issues.apache.org/jira/browse/HDDS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997406#comment-16997406 ] Yiqun Lin commented on HDDS-2031: - I am learning the topology awareness in Ozone. What's the status of this JIRA? This is the useful change of this feature I think. > Choose datanode for pipeline creation based on network topology > --- > > Key: HDDS-2031 > URL: https://issues.apache.org/jira/browse/HDDS-2031 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Sammi Chen >Assignee: Sammi Chen >Priority: Major > > There are regular heartbeats between datanodes in a pipeline. Choose > datanodes based on network topology, to guarantee data reliability and reduce > heartbeat network traffic latency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2690) Improve the command usage of audit parser tool
Yiqun Lin created HDDS-2690: --- Summary: Improve the command usage of audit parser tool Key: HDDS-2690 URL: https://issues.apache.org/jira/browse/HDDS-2690 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Tools Affects Versions: 0.4.0 Reporter: Yiqun Lin I did the test for ozone-0.4, but found audit parser tool is not so friendly to use for user. I input -h option to get help usage, then only this few messages: {noformat} Usage: ozone auditparser [-hV] [--verbose] [-D=]... [COMMAND] Shell parser for Ozone Audit Logs Existing or new .db file --verboseMore verbose output. Show the stack trace of the errors. -D, --set= -h, --help Show this help message and exit. -V, --versionPrint version information and exit. Commands: load, l Load ozone audit log files template, t Execute template query query, q Execute custom query {noformat} Although it shows me 3 options we can use, but I still not know the complete command to execute, like how to load audit to db, which available template I can use? Then I have to get detailed usage from tool doc then come back to execute command, it's not effective to use. It will be better to add some necessary usage (table structure, available templates, ...)info in command usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org