[jira] [Commented] (HDDS-4308) Fix issue with quota update

2020-10-15 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215148#comment-17215148
 ] 

Yiqun Lin commented on HDDS-4308:
-

{quote}This might not be complete i believe, If 2 threads acquire copy object 
and if they update outside lock we have issue again. I think the whole 
operation should be performed under volume lock. (As we update in-memory it 
should be quick) But i agree that it might have performance impact across 
buckets when key writes happen.
{quote}
Use volume lock during bucket operation makes logic a little complex. 
 As current PR change does:
 1)acquire bucket lock
 2)release bucket lock
 3)acquire volume lock
    update volume usedBytes usage
 4)release volume lock
 5)acquire bucket lock again (to finish remaining operation)
 6)release bucket lock

Can we just make the method OMKeyRequest#getVolumeInfo be thread safe to return 
a copied object, this should be okay for current issue? This will make the 
logic more simplified.
 Like:
{code:java}
  public static synchronized OmVolumeArgs getVolumeInfo(OMMetadataManager 
omMetadataManager,
  String volume) {
return omMetadataManager.getVolumeTable().getCacheValue(
new CacheKey<>(omMetadataManager.getVolumeKey(volume)))
.getCacheValue().copyObject();
  }
{code}

> Fix issue with quota update
> ---
>
> Key: HDDS-4308
> URL: https://issues.apache.org/jira/browse/HDDS-4308
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: mingchao zhao
>Priority: Blocker
>  Labels: pull-request-available
>
> Currently volumeArgs using getCacheValue and put the same object in 
> doubleBuffer, this might cause issue.
> Let's take the below scenario:
> InitialVolumeArgs quotaBytes -> 1
> 1. T1 -> Update VolumeArgs, and subtracting 1000 and put this updated 
> volumeArgs to DoubleBuffer.
> 2. T2-> Update VolumeArgs, and subtracting 2000 and has not still updated to 
> double buffer.
> *Now at the end of flushing these transactions, our DB should have 7000 as 
> bytes used.*
> Now T1 is picked by double Buffer and when it commits, and as it uses cached 
> Object put into doubleBuffer, it flushes to DB with the updated value from 
> T2(As it is a cache object) and update DB with bytesUsed as 7000.
> And now OM has restarted, and only DB has transactions till T1. (We get this 
> info from TransactionInfo 
> Table(https://issues.apache.org/jira/browse/HDDS-3685)
> Now T2 is again replayed, as it is not committed to DB, now DB will be again 
> subtracted with 2000, and now DB will have 5000.
> But after T2, the value should be 7000, so we have DB in an incorrect state.
> Issue here:
> 1. As we use a cached object and put the same cached object into double 
> buffer this can cause this kind of issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4280) Document notable configurations for Recon

2020-10-05 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-4280:

Fix Version/s: 1.1.0

> Document notable configurations for Recon 
> --
>
> Key: HDDS-4280
> URL: https://issues.apache.org/jira/browse/HDDS-4280
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Recon
>Affects Versions: 1.0.0
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> In [Reon doc 
> link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is 
> no helpful description about how to quickly setup the Recon server. As Recon 
> is one major feature in Ozone 1.0 version, we need to completed this document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-4285) Read is slow due to the frequent usage of UGI.getCurrentUserCall()

2020-09-28 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203321#comment-17203321
 ] 

Yiqun Lin edited comment on HDDS-4285 at 9/28/20, 3:58 PM:
---

Looking into this, I am thinking of two approaches for this:

1. Initialize UGI instance in ChunkInputStream (or other invoke places), then 
set UGI in XceiverClientSpi,  extract UGI and get token string in 
ContainerProtocolCalls method.

2. Make UGI as a thread local field in ContainerProtocolCalls, and then reset 
ContainerProtocolCalls#UGI in ChunkInputStream or other places.

#1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in 
other places.
  


was (Author: linyiqun):
Looking into this, I am thinking of two approaches for this:

1. Initialize UGI instance in ChunkInputStream (or other invoke places), then 
set UGI in XceiverClientSpi,  extract UGI and get token string in 
ContainerProtocolCalls method.

2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI 
in ChunkInputStream or other similar places.

#1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in 
other places.
  

> Read is slow due to the frequent usage of UGI.getCurrentUserCall()
> --
>
> Key: HDDS-4285
> URL: https://issues.apache.org/jira/browse/HDDS-4285
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
> Attachments: image-2020-09-28-16-19-17-581.png, 
> profile-20200928-161631-180518.svg
>
>
> Ozone read operation turned out to be slow mainly because we do a new 
> UGI.getCurrentUser for block token for each of the calls.
> We need to cache the block token / UGI.getCurrentUserCall() to make it faster.
>  !image-2020-09-28-16-19-17-581.png! 
> To reproduce:
> Checkout: https://github.com/elek/hadoop-ozone/tree/mocked-read
> {code}
> cd hadoop-ozone/client
> export 
> MAVEN_OPTS=-agentpath:/home/elek/prog/async-profiler/build/libasyncProfiler.so=start,file=/tmp/profile-%t-%p.svg
> mvn compile exec:java 
> -Dexec.mainClass=org.apache.hadoop.ozone.client.io.TestKeyOutputStreamUnit 
> -Dexec.classpathScope=test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-4285) Read is slow due to the frequent usage of UGI.getCurrentUserCall()

2020-09-28 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203321#comment-17203321
 ] 

Yiqun Lin edited comment on HDDS-4285 at 9/28/20, 3:56 PM:
---

Looking into this, I am thinking of two approaches for this:

1. Initialize UGI instance in ChunkInputStream (or other invoke places), then 
set UGI in XceiverClientSpi,  extract UGI and get token string in 
ContainerProtocolCalls method.

2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI 
in ChunkInputStream or other similar places.

#1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in 
other places.
  


was (Author: linyiqun):
Looking into this, I am thinking of two approaches for this:

1. Get UGI instance in ChunkInputStream (or other invoke places), then set UGI 
in XceiverClientSpi,  extract UGI and get token string in 
ContainerProtocolCalls method.

2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI 
in ChunkInputStream or other similar places.

#1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in 
other places.
  

> Read is slow due to the frequent usage of UGI.getCurrentUserCall()
> --
>
> Key: HDDS-4285
> URL: https://issues.apache.org/jira/browse/HDDS-4285
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
> Attachments: image-2020-09-28-16-19-17-581.png, 
> profile-20200928-161631-180518.svg
>
>
> Ozone read operation turned out to be slow mainly because we do a new 
> UGI.getCurrentUser for block token for each of the calls.
> We need to cache the block token / UGI.getCurrentUserCall() to make it faster.
>  !image-2020-09-28-16-19-17-581.png! 
> To reproduce:
> Checkout: https://github.com/elek/hadoop-ozone/tree/mocked-read
> {code}
> cd hadoop-ozone/client
> export 
> MAVEN_OPTS=-agentpath:/home/elek/prog/async-profiler/build/libasyncProfiler.so=start,file=/tmp/profile-%t-%p.svg
> mvn compile exec:java 
> -Dexec.mainClass=org.apache.hadoop.ozone.client.io.TestKeyOutputStreamUnit 
> -Dexec.classpathScope=test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4285) Read is slow due to the frequent usage of UGI.getCurrentUserCall()

2020-09-28 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203321#comment-17203321
 ] 

Yiqun Lin commented on HDDS-4285:
-

Looking into this, I am thinking of two approaches for this:

1. Get UGI instance in ChunkInputStream (or other invoke places), then set UGI 
in XceiverClientSpi,  extract UGI and get token string in 
ContainerProtocolCalls method.

2. Make UGI as a thread local field in ContainerProtocolCalls, and then set UGI 
in ChunkInputStream or other similar places.

#1 is a more generic approach, UGI stored in XceiverClientSpi can be reused in 
other places.
  

> Read is slow due to the frequent usage of UGI.getCurrentUserCall()
> --
>
> Key: HDDS-4285
> URL: https://issues.apache.org/jira/browse/HDDS-4285
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
> Attachments: image-2020-09-28-16-19-17-581.png, 
> profile-20200928-161631-180518.svg
>
>
> Ozone read operation turned out to be slow mainly because we do a new 
> UGI.getCurrentUser for block token for each of the calls.
> We need to cache the block token / UGI.getCurrentUserCall() to make it faster.
>  !image-2020-09-28-16-19-17-581.png! 
> To reproduce:
> Checkout: https://github.com/elek/hadoop-ozone/tree/mocked-read
> {code}
> cd hadoop-ozone/client
> export 
> MAVEN_OPTS=-agentpath:/home/elek/prog/async-profiler/build/libasyncProfiler.so=start,file=/tmp/profile-%t-%p.svg
> mvn compile exec:java 
> -Dexec.mainClass=org.apache.hadoop.ozone.client.io.TestKeyOutputStreamUnit 
> -Dexec.classpathScope=test
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4283) Remove unsupported upgrade command in ozone cli

2020-09-28 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203145#comment-17203145
 ] 

Yiqun Lin commented on HDDS-4283:
-

Thanks [~adoroszlai] for the reference, closed this JIRA.

> Remove unsupported upgrade command in ozone cli
> ---
>
> Key: HDDS-4283
> URL: https://issues.apache.org/jira/browse/HDDS-4283
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> In HDDS-1383, we introduce a new upgrade command for supporting to in-place 
> upgrade from HDFS to Ozone.
> {noformat}
> upgrade   HDFS to Ozone in-place upgrade tool
> 
> Usage: ozone upgrade [-hV] [--verbose] [-conf=]
>  [-D=]... [COMMAND]
> Convert raw HDFS data to Ozone data without data movement.
>   --verbose   More verbose output. Show the stack trace of the errors.
>   -conf=
>   -D, --set=
>   -h, --help  Show this help message and exit.
>   -V, --version   Print version information and exit.
> Commands:
>   plan Plan existing HDFS block distribution and give.estimation.
>   balance  Move the HDFS blocks for a better distribution usage.
>   execute  Start/restart upgrade from HDFS to Ozone cluster.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan
> [In-Place upgrade : plan] is not yet supported.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance
> [In-Place upgrade : balance] is not yet supported.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute
> In-Place upgrade : execute] is not yet supported
> {noformat}
> But this feature has not been implemented yet and is a very big feature. 
>  I don't think it's good to expose a cli command that is not supported and 
> meanwhile that cannot be quickly implemented in the short term.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-4283) Remove unsupported upgrade command in ozone cli

2020-09-28 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin resolved HDDS-4283.
-
Resolution: Duplicate

> Remove unsupported upgrade command in ozone cli
> ---
>
> Key: HDDS-4283
> URL: https://issues.apache.org/jira/browse/HDDS-4283
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> In HDDS-1383, we introduce a new upgrade command for supporting to in-place 
> upgrade from HDFS to Ozone.
> {noformat}
> upgrade   HDFS to Ozone in-place upgrade tool
> 
> Usage: ozone upgrade [-hV] [--verbose] [-conf=]
>  [-D=]... [COMMAND]
> Convert raw HDFS data to Ozone data without data movement.
>   --verbose   More verbose output. Show the stack trace of the errors.
>   -conf=
>   -D, --set=
>   -h, --help  Show this help message and exit.
>   -V, --version   Print version information and exit.
> Commands:
>   plan Plan existing HDFS block distribution and give.estimation.
>   balance  Move the HDFS blocks for a better distribution usage.
>   execute  Start/restart upgrade from HDFS to Ozone cluster.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan
> [In-Place upgrade : plan] is not yet supported.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance
> [In-Place upgrade : balance] is not yet supported.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute
> In-Place upgrade : execute] is not yet supported
> {noformat}
> But this feature has not been implemented yet and is a very big feature. 
>  I don't think it's good to expose a cli command that is not supported and 
> meanwhile that cannot be quickly implemented in the short term.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4283) Remove unsupported upgrade command in ozone cli

2020-09-27 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-4283:

Description: 
In HDDS-1383, we introduce a new upgrade command for supporting to in-place 
upgrade from HDFS to Ozone.
{noformat}
upgrade   HDFS to Ozone in-place upgrade tool

Usage: ozone upgrade [-hV] [--verbose] [-conf=]
 [-D=]... [COMMAND]
Convert raw HDFS data to Ozone data without data movement.
  --verbose   More verbose output. Show the stack trace of the errors.
  -conf=

  -D, --set=

  -h, --help  Show this help message and exit.
  -V, --version   Print version information and exit.
Commands:
  plan Plan existing HDFS block distribution and give.estimation.
  balance  Move the HDFS blocks for a better distribution usage.
  execute  Start/restart upgrade from HDFS to Ozone cluster.

[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan
[In-Place upgrade : plan] is not yet supported.
[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance
[In-Place upgrade : balance] is not yet supported.
[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute
In-Place upgrade : execute] is not yet supported
{noformat}
But this feature has not been implemented yet and is a very big feature. 
 I don't think it's good to expose a cli command that is not supported and 
meanwhile that cannot be quickly implemented in the short term.

  was:
In HDDS-1383, we introduce a new upgrade command for supporting to in-place 
upgrade from HDFS to Ozone.
{noformat}
upgrade   HDFS to Ozone in-place upgrade tool

Usage: ozone upgrade [-hV] [--verbose] [-conf=]
 [-D=]... [COMMAND]
Convert raw HDFS data to Ozone data without data movement.
  --verbose   More verbose output. Show the stack trace of the errors.
  -conf=

  -D, --set=

  -h, --help  Show this help message and exit.
  -V, --version   Print version information and exit.
Commands:
  plan Plan existing HDFS block distribution and give.estimation.
  balance  Move the HDFS blocks for a better distribution usage.
  execute  Start/restart upgrade from HDFS to Ozone cluster.

[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan
[In-Place upgrade : plan] is not yet supported.
[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance
[In-Place upgrade : balance] is not yet supported.
[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute
In-Place upgrade : execute] is not yet supported
{noformat}
But this feature has not been implemented yet and is a very big feature. 
 I don't think it's good to expose a cli command that is not supported and 
meanwhile cannot be quickly implemented in the short term.


> Remove unsupported upgrade command in ozone cli
> ---
>
> Key: HDDS-4283
> URL: https://issues.apache.org/jira/browse/HDDS-4283
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> In HDDS-1383, we introduce a new upgrade command for supporting to in-place 
> upgrade from HDFS to Ozone.
> {noformat}
> upgrade   HDFS to Ozone in-place upgrade tool
> 
> Usage: ozone upgrade [-hV] [--verbose] [-conf=]
>  [-D=]... [COMMAND]
> Convert raw HDFS data to Ozone data without data movement.
>   --verbose   More verbose output. Show the stack trace of the errors.
>   -conf=
>   -D, --set=
>   -h, --help  Show this help message and exit.
>   -V, --version   Print version information and exit.
> Commands:
>   plan Plan existing HDFS block distribution and give.estimation.
>   balance  Move the HDFS blocks for a better distribution usage.
>   execute  Start/restart upgrade from HDFS to Ozone cluster.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan
> [In-Place upgrade : plan] is not yet supported.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance
> [In-Place upgrade : balance] is not yet supported.
> [hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute
> In-Place upgrade : execute] is not yet supported
> {noformat}
> But this feature has not been implemented yet and is a very big feature. 
>  I don't think it's good to expose a cli command that is not supported and 
> meanwhile that cannot be quickly implemented in the short term.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4283) Remove unsupported upgrade command in ozone cli

2020-09-27 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-4283:
---

 Summary: Remove unsupported upgrade command in ozone cli
 Key: HDDS-4283
 URL: https://issues.apache.org/jira/browse/HDDS-4283
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Yiqun Lin
Assignee: Yiqun Lin


In HDDS-1383, we introduce a new upgrade command for supporting to in-place 
upgrade from HDFS to Ozone.
{noformat}
upgrade   HDFS to Ozone in-place upgrade tool

Usage: ozone upgrade [-hV] [--verbose] [-conf=]
 [-D=]... [COMMAND]
Convert raw HDFS data to Ozone data without data movement.
  --verbose   More verbose output. Show the stack trace of the errors.
  -conf=

  -D, --set=

  -h, --help  Show this help message and exit.
  -V, --version   Print version information and exit.
Commands:
  plan Plan existing HDFS block distribution and give.estimation.
  balance  Move the HDFS blocks for a better distribution usage.
  execute  Start/restart upgrade from HDFS to Ozone cluster.

[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade plan
[In-Place upgrade : plan] is not yet supported.
[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade balance
[In-Place upgrade : balance] is not yet supported.
[hdfs@lyq ~]$ ~/ozone/bin/ozone upgrade execute
In-Place upgrade : execute] is not yet supported
{noformat}
But this feature has not been implemented yet and is a very big feature. 
 I don't think it's good to expose a cli command that is not supported and 
meanwhile cannot be quickly implemented in the short term.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4280) Document notable configurations for Recon

2020-09-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-4280:

Status: Patch Available  (was: Open)

> Document notable configurations for Recon 
> --
>
> Key: HDDS-4280
> URL: https://issues.apache.org/jira/browse/HDDS-4280
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Recon
>Affects Versions: 1.0.0
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
>
> In [Reon doc 
> link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is 
> no helpful description about how to quickly setup the Recon server. As Recon 
> is one major feature in Ozone 1.0 version, we need to completed this document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4280) Document notable configurations for Recon

2020-09-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-4280:

Summary: Document notable configurations for Recon   (was: Document notable 
configuration for Recon )

> Document notable configurations for Recon 
> --
>
> Key: HDDS-4280
> URL: https://issues.apache.org/jira/browse/HDDS-4280
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Recon
>Affects Versions: 1.0.0
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
>
> In [Reon doc 
> link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is 
> no helpful description about how to quickly setup the Recon server. As Recon 
> is one major feature in Ozone 1.0 version, we need to completed this document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4280) Document notable configuration for Recon

2020-09-26 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-4280:
---

 Summary: Document notable configuration for Recon 
 Key: HDDS-4280
 URL: https://issues.apache.org/jira/browse/HDDS-4280
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Ozone Recon
Affects Versions: 1.0.0
Reporter: Yiqun Lin
Assignee: Yiqun Lin


In [Reon doc 
link|https://hadoop.apache.org/ozone/docs/1.0.0/feature/recon.html], there is 
no helpful description about how to quickly setup the Recon server. As Recon is 
one major feature in Ozone 1.0 version, we need to completed this document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4267) Ozone command always print warn message before execution

2020-09-22 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-4267:
---

 Summary: Ozone command always print warn message before execution
 Key: HDDS-4267
 URL: https://issues.apache.org/jira/browse/HDDS-4267
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Ozone CLI
Reporter: Yiqun Lin


Ozone command always print warn message before execution:
{noformat}
[hdfs@lyq yiqlin]$ ~/ozone/bin/ozone version
/home/hdfs/releases/ozone-1.0.0/etc/hadoop/hadoop-env.sh: line 34: ulimit: core 
file size: cannot modify limit: Operation not permitted
{noformat}
{noformat}
[hdfs@ yiqlin]$ ~/ozone/bin/ozone sh volume list
/home/hdfs/releases/ozone-1.0.0/etc/hadoop/hadoop-env.sh: line 34: ulimit: core 
file size: cannot modify limit: Operation not permitted
{noformat}
This is because that hdfs in my cluster cannot execute below command in 
hadoop-en.sh
{noformat}
# # Enable core dump when crash in C++
ulimit -c unlimited
{noformat}
ulimit -c was introduced in JIRA HDDS-3941. The root cause seems that ulimit -c 
requires root user to execute but hdfs user in my local is a non-root user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4222) [OzoneFS optimization] Provide a mechanism for efficient path lookup

2020-09-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194239#comment-17194239
 ] 

Yiqun Lin commented on HDDS-4222:
-

[~rakeshr] , thanks for the explanation, it's very clear explained for me.  I'm 
also +1 to let KeyDeletingService to help remove deleted cache entries here.

> [OzoneFS optimization] Provide a mechanism for efficient path lookup
> 
>
> Key: HDDS-4222
> URL: https://issues.apache.org/jira/browse/HDDS-4222
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Rakesh Radhakrishnan
>Assignee: Rakesh Radhakrishnan
>Priority: Major
> Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf
>
>
> With the new file system HDDS-2939 like semantics design it requires multiple 
> DB lookups to traverse the path component in top-down fashion. This task to 
> discuss use cases and proposals to reduce the performance penalties during 
> path lookups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-4222) [OzoneFS optimization] Provide a mechanism for efficient path lookup

2020-09-09 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192947#comment-17192947
 ] 

Yiqun Lin edited comment on HDDS-4222 at 9/9/20, 3:29 PM:
--

Thanks for attaching the dir cache design, [~rakeshr]!
 I agree most of the current design details.

>For the consistency part, this is a very good point and will take care during 
>the implementation phase. I was thinking to update the cache during write and 
>read paths to avoid additional cache refresh cycle.
 I'm +1 for this way as current initial implementation.

>Rename and Delete ops will require only one entry update as it maintains 
>similar structure in the DB Directory Table.
 Delete ops is also not friendly for the Approach-3.

Example:

*DirTable:*
|CacheKey(PathElement)|{color:#ff8b00}*ObjectID*{color}|
|512/a|1025|
|1025/b|1026|
|1026/c|1027|
|1027/d|1028|
|1025/e|1029|

If we delete dir 512/a, it should lookup the whole dir cache and find the key 
which parent objectID is 1025 and then be deleted. So delete ops here seems 
still the very expensive ops.

>Delete ops will require only one entry update
 If we use the sync delete way, it will not update for only one entry (as 
explained in above example). 
 So is this mean the async delete way here, like bucket key deletion mechanism?
 1) Mark the delete key and let it not be accessed(e.g. add prefix in key)
 2) Async to remove these keys that needed to be deleted.


was (Author: linyiqun):
Thanks for attaching the dir cache design, [~rakeshr]!
 I agree most of the current design details.

>For the consistency part, this is a very good point and will take care during 
>the implementation phase. I was thinking to update the cache during write and 
>read paths to avoid additional cache refresh cycle.
 I'm +1 for this way as current initial implementation.

>Rename and Delete ops will require only one entry update as it maintains 
>similar structure in the DB Directory Table.
 Delete ops is also not friendly for the Approach-3.

Example:

*DirTable:*
|CacheKey(PathElement)|{color:#ff8b00}*ObjectID*{color}|
|512/a|1025|
|1025/b|1026|
|1026/c|1027|
|1027/d|1028|
|1025/e|1029|

If we delete dir 512/a, it should lookup the whole dir cache and find the key 
which parent objectID is 1025 and then be deleted. So delete ops here seems 
still the very expensive ops.

>Delete ops will require only one entry update
 If we use the sync delete way, it will not update for only one entry (as 
explained in above example). 
 So is this mean the async delete way here, like bucket key deletion mechanism?
 1) Mark the delete key,(add prefix in key to let it not be accessed)
 2) Async to remove these keys that needed to be deleted.

> [OzoneFS optimization] Provide a mechanism for efficient path lookup
> 
>
> Key: HDDS-4222
> URL: https://issues.apache.org/jira/browse/HDDS-4222
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Rakesh Radhakrishnan
>Assignee: Rakesh Radhakrishnan
>Priority: Major
> Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf
>
>
> With the new file system HDDS-2939 like semantics design it requires multiple 
> DB lookups to traverse the path component in top-down fashion. This task to 
> discuss use cases and proposals to reduce the performance penalties during 
> path lookups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4222) [OzoneFS optimization] Provide a mechanism for efficient path lookup

2020-09-09 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192947#comment-17192947
 ] 

Yiqun Lin commented on HDDS-4222:
-

Thanks for attaching the dir cache design, [~rakeshr]!
 I agree most of the current design details.

>For the consistency part, this is a very good point and will take care during 
>the implementation phase. I was thinking to update the cache during write and 
>read paths to avoid additional cache refresh cycle.
 I'm +1 for this way as current initial implementation.

>Rename and Delete ops will require only one entry update as it maintains 
>similar structure in the DB Directory Table.
 Delete ops is also not friendly for the Approach-3.

Example:

*DirTable:*
|CacheKey(PathElement)|{color:#ff8b00}*ObjectID*{color}|
|512/a|1025|
|1025/b|1026|
|1026/c|1027|
|1027/d|1028|
|1025/e|1029|

If we delete dir 512/a, it should lookup the whole dir cache and find the key 
which parent objectID is 1025 and then be deleted. So delete ops here seems 
still the very expensive ops.

>Delete ops will require only one entry update
 If we use the sync delete way, it will not update for only one entry (as 
explained in above example). 
 So is this mean the async delete way here, like bucket key deletion mechanism?
 1) Mark the delete key,(add prefix in key to let it not be accessed)
 2) Async to remove these keys that needed to be deleted.

> [OzoneFS optimization] Provide a mechanism for efficient path lookup
> 
>
> Key: HDDS-4222
> URL: https://issues.apache.org/jira/browse/HDDS-4222
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Rakesh Radhakrishnan
>Assignee: Rakesh Radhakrishnan
>Priority: Major
> Attachments: Ozone FS Optimizations - Efficient Lookup using cache.pdf
>
>
> With the new file system HDDS-2939 like semantics design it requires multiple 
> DB lookups to traverse the path component in top-down fashion. This task to 
> discuss use cases and proposals to reduce the performance penalties during 
> path lookups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace

2020-09-03 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190253#comment-17190253
 ] 

Yiqun Lin edited comment on HDDS-2939 at 9/3/20, 4:19 PM:
--

Some discussion about latest status of HDDS-2939 that I asked in mailing list.
 From [~rakeshr]:
{quote}Presently, I am working on the directory cache design and upgrade design.
 These two tasks are very important as the first one would help to *reduce
 the performance penalties on the path traversal*. Later one is to provide
 an efficient way to make a smooth upgrade experience to the users.
{quote}
Here the directory cache is used for avoid the additional look up overheads. 
Latest design of directory cache hasn't been attached but just some thoughts 
from me:

Two type mapping cache will be useful I think:
 * , like , so that we 
can skip the traverse search from dir table to key table.
 * >, this is used for the listStatus scenario, list 
files call can be a very expensive call under Ozone fs namespace.

Cache introduced here can speed up the metadata access but also there are two 
aspects we need to consider.
 * Cache entry eviction policy for this, we cannot cache all the dir/file 
entries.
 * Consistency between dir cache and underlying store. Cache entry will become 
stale when db store updated but not synced in corresponding cache entry. The 
cache refresh interval time can be introduced here. Only when the cache entry 
not updated more than given refresh interval, then we trigger update cache 
entry from querying the db table. Users can set different refresh interval time 
to ensure the cache freshness based on their scenarios. Also they can disable 
this cache by set interval to 0 that means each query will directly access to 
db.

Current OM table cache seems not very helpful for dir cache so I came up with 
above thoughts.


was (Author: linyiqun):
Some discussion about latest status of HDDS-2939 that I asked in mailing list.
 From [~rakeshr]:
{quote}Presently, I am working on the directory cache design and upgrade design.
 These two tasks are very important as the first one would help to *reduce
 the performance penalties on the path traversal*. Later one is to provide
 an efficient way to make a smooth upgrade experience to the users.
{quote}
Here the directory cache is used for avoid the additional look up overheads. 
Latest design of directory cache hasn't been attached but just some thoughts 
from me:

Two type mapping cache will be useful I think:
 * , like , so that we 
can skip the traverse search from dir table to key table.
 * >, this is used for the listStatus scenario, list 
files call can be a very expensive call under Ozone fs namespace.

Cache introduced here can speed up the metadata access but also there are two 
aspects we need to consider.
 * Cache entry eviction policy for this, we cannot cache all the dir/file 
entries.
 * Consistency between dir cache and underlying store. Cache entry will become 
stale when db store updated but not synced in corresponding cache entry. The 
cache refresh interval time can be introduced here. Only when the cache entry 
not updated more than given refresh interval, then we trigger update cache 
entry from querying the db table. Users can set different refresh interval time 
to ensure the cache freshness based on their scenarios. Also they can disable 
this cache by set interval to 0 that means each query will directly access to 
db.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace

2020-09-03 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190253#comment-17190253
 ] 

Yiqun Lin edited comment on HDDS-2939 at 9/3/20, 4:04 PM:
--

Some discussion about latest status of HDDS-2939 that I asked in mailing list.
 From [~rakeshr]:
{quote}Presently, I am working on the directory cache design and upgrade design.
 These two tasks are very important as the first one would help to *reduce
 the performance penalties on the path traversal*. Later one is to provide
 an efficient way to make a smooth upgrade experience to the users.
{quote}
Here the directory cache is used for avoid the additional look up overheads. 
Latest design of directory cache hasn't been attached but just some thoughts 
from me:

Two type mapping cache will be useful I think:
 * , like , so that we 
can skip the traverse search from dir table to key table.
 * >, this is used for the listStatus scenario, list 
files call can be a very expensive call under Ozone fs namespace.

Cache introduced here can speed up the metadata access but also there are two 
aspects we need to consider.
 * Cache entry eviction policy for this, we cannot cache all the dir/file 
entries.
 * Consistency between dir cache and underlying store. Cache entry will become 
stale when db store updated but not synced in corresponding cache entry. The 
cache refresh interval time can be introduced here. Only when the cache entry 
not updated more than given refresh interval, then we trigger update cache 
entry from querying the db table. Users can set different refresh interval time 
to ensure the cache freshness based on their scenarios. Also they can disable 
this cache by set interval to 0 that means each query will directly access to 
db.


was (Author: linyiqun):
Some discussion about latest status of HDDS-2939 that I asked in mailing list.
 From [~rakeshr]:
{quote}Presently, I am working on the directory cache design and upgrade design.
 These two tasks are very important as the first one would help to *reduce
 the performance penalties on the path traversal*. Later one is to provide
 an efficient way to make a smooth upgrade experience to the users.
{quote}
Here the directory cache is used for avoid the additional look up overheads. 
Latest design of directory cache hasn't been attached but just some thoughts 
from me:

Two type mapping cache will be useful I think:
 * , like , so that we 
can skip the traverse search from dir table to key table.
 * >, this is used for the listStatus scenario, list 
files call can be a very expensive call under Ozone fs namespace.

Cache introduced here can speed up the metadata access but also there are two 
aspects we need to consider.
 * Cache entry eviction policy for this, we cannot cache all the dir/file 
entries.
 * Consistency between dir cache and underlying store. The cache refresh 
interval time can be introduced here. Only when the cache entry not updated 
more than given refresh interval, then we trigger update cache entry from 
querying the db table. Users can set different refresh interval time to ensure 
the cache freshness based on their scenarios. Also they can disable this cache 
by set interval to 0 that means each query will directly access to db.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-09-03 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190253#comment-17190253
 ] 

Yiqun Lin commented on HDDS-2939:
-

Some discussion about latest status of HDDS-2939 that I asked in mailing list.
 From [~rakeshr]:
{quote}Presently, I am working on the directory cache design and upgrade design.
 These two tasks are very important as the first one would help to *reduce
 the performance penalties on the path traversal*. Later one is to provide
 an efficient way to make a smooth upgrade experience to the users.
{quote}
Here the directory cache is used for avoid the additional look up overheads. 
Latest design of directory cache hasn't been attached but just some thoughts 
from me:

Two type mapping cache will be useful I think:
 * , like , so that we 
can skip the traverse search from dir table to key table.
 * >, this is used for the listStatus scenario, list 
files call can be a very expensive call under Ozone fs namespace.

Cache introduced here can speed up the metadata access but also there are two 
aspects we need to consider.
 * Cache entry eviction policy for this, we cannot cache all the dir/file 
entries.
 * Consistency between dir cache and underlying store. The cache refresh 
interval time can be introduced here. Only when the cache entry not updated 
more than given refresh interval, then we trigger update cache entry from 
querying the db table. Users can set different refresh interval time to ensure 
the cache freshness based on their scenarios. Also they can disable this cache 
by set interval to 0 that means each query will directly access to db.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4166) Documentation index page redirects to the wrong address

2020-08-28 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-4166:
---

 Summary: Documentation index page redirects to the wrong address
 Key: HDDS-4166
 URL: https://issues.apache.org/jira/browse/HDDS-4166
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: documentation
Reporter: Yiqun Lin
 Attachments: image-2020-08-29-10-35-34-633.png

Reading Chinese doc of Ozone that introduced in HDDS-2708. I find a page error 
that index page redirects a wrong page.

The steps:
 1. Jump into 
[https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/index.html]
 2. Click the page switch button.
 3. The wrong page we will jumped into:
 
[https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/zh/]
 !image-2020-08-29-10-35-34-633.png!

It missed the 'index.html' at the end of the address, actually this address is 
expected to 
[https://ci-hadoop.apache.org/view/Hadoop%20Ozone/job/ozone-doc-master/lastSuccessfulBuild/artifact/hadoop-hdds/docs/public/zh/index.html]

The same error happened when I switch the index page from Chinese doc to 
English doc.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-07-29 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167289#comment-17167289
 ] 

Yiqun Lin commented on HDDS-2939:
-

{quote}
Along with this feature, one task is planned to provide a migration/conversion 
tool to migrate existing 'KeyTable' data content into the new data format, 
ofcourse we need to add a key generation logic here.
{quote}
yes, makes sense to me.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace

2020-07-25 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165155#comment-17165155
 ] 

Yiqun Lin edited comment on HDDS-2939 at 7/26/20, 4:06 AM:
---

Hi [~rakeshr], I take the quick look for the first PR PR-1230. It reuses the 
PrefixTable to store like directory info I have a question: this will not break 
original key lookup behavior? We don't have the object id assigned for keys in 
prefix table, but current logic assumes that each entry key has its own 
objectId in table. Or maybe we will have a step to generate this object id 
firstly when users start to enable this feature from their existed Ozone system.


was (Author: linyiqun):
Hi [~rakeshr], I take the quick look for the first PR PR-1230. It reuses the 
PrefixTable to store like directory info I have a question: this will not break 
original key lookup behavior? We don't have the object id assigned for keys in 
prefix table, but current logic assumes that each entry key has its own 
objectId in table.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-07-25 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165155#comment-17165155
 ] 

Yiqun Lin commented on HDDS-2939:
-

Hi [~rakeshr], I take the quick look for the first PR PR-1230. It reuses the 
PrefixTable to store like directory info I have a question: this will not break 
original key lookup behavior? We don't have the object id assigned for keys in 
prefix table, but current logic assumes that each entry key has its own 
objectId in table.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3816) Erasure Coding in Apache Hadoop Ozone

2020-07-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156177#comment-17156177
 ] 

Yiqun Lin commented on HDDS-3816:
-

Sorry for the delayed response. Thanks [~umamaheswararao] for the detailed 
explanation, :).

> Erasure Coding in Apache Hadoop Ozone
> -
>
> Key: HDDS-3816
> URL: https://issues.apache.org/jira/browse/HDDS-3816
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: SCM
>Reporter: Uma Maheswara Rao G
>Priority: Major
> Attachments: Erasure Coding in Apache Hadoop Ozone.pdf
>
>
> We propose to implement Erasure Coding in Apache Hadoop Ozone to provide 
> efficient storage. With EC in place, Ozone can provide same or better 
> tolerance by giving 50% or more  storage space savings. 
> In HDFS project, we already have native codecs(ISAL) and Java codecs 
> implemented, we can leverage the same or similar codec design.
> However, the critical part of EC data layout design is in-progress, we will 
> post the design doc soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-07-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149781#comment-17149781
 ] 

Yiqun Lin commented on HDDS-2939:
-

Thanks for the detailed explanation, [~maobaolong]. I also agree that caching 
tier can speed up metadata access operation in Ozone.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace

2020-06-29 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147817#comment-17147817
 ] 

Yiqun Lin edited comment on HDDS-2939 at 6/29/20, 2:43 PM:
---

Hi [~maobaolong], looked into the approach that used in Alluxio, I am curious 
one thing: In Alluxio 2.0, it still keeps all metadata in its master service?

I see this in article 
[https://dzone.com/articles/store-1-billion-files-in-alluxio-20] that describes 
how to store 1 billion files metadata in Alluxio. 
{quote}The metadata service in Alluxio 2.0 is designed to support at least 1 
billion files with a significantly reduced memory requirement. To achieve this, 
we added support for storing part of the namespace off-heap by RocksDB on disk. 
Recently-accessed file system metadata is stored in memory,...
{quote}
>From my understanding of this, in disk only part of metadata stored and 
>meanwhile in memory it caches recent-accessed data, does it really store all 
>metadata in its master service? does it match the case of Ozone Fs? In Ozone 
>Fs, we will store all metadata. Or can  I understand that Alluxio 2.0 
>maintains only active metadata instead of the whole metadata, this active 
>metadata can be updated(activated/deactivated) by the user access file 
>behaviors, so it can support billion level metadata.

BTW, memory cached only for hot metadata is a good point that Ozone Fs can also 
benefit from this.


was (Author: linyiqun):
Hi [~maobaolong], looked into the approach that used in Alluxio, I am curious 
one thing: In Alluxio 2.0, it still keeps all metadata in its master service?

I see this in article 
[https://dzone.com/articles/store-1-billion-files-in-alluxio-20] that describes 
how to store 1 billion files metadata in Alluxio. 
{quote}The metadata service in Alluxio 2.0 is designed to support at least 1 
billion files with a significantly reduced memory requirement. To achieve this, 
we added support for storing part of the namespace off-heap by RocksDB on disk. 
Recently-accessed file system metadata is stored in memory,...
{quote}
>From my understanding of this, in disk only part of metadata stored and 
>meanwhile in memory it caches recent-accessed data, does it really store all 
>metadata in its master service? does it match the case of Ozone Fs? In Ozone 
>Fs, we will store all metadata.

BTW, memory cached only for hot metadata is a good point that Ozone Fs can also 
benefit from this.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-06-29 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147817#comment-17147817
 ] 

Yiqun Lin commented on HDDS-2939:
-

Hi [~maobaolong], looked into the approach that used in Alluxio, I am curious 
one thing: In Alluxio 2.0, it still keeps all metadata in its master service?

I see this in article 
[https://dzone.com/articles/store-1-billion-files-in-alluxio-20] that describes 
how to store 1 billion files metadata in Alluxio. 
{quote}The metadata service in Alluxio 2.0 is designed to support at least 1 
billion files with a significantly reduced memory requirement. To achieve this, 
we added support for storing part of the namespace off-heap by RocksDB on disk. 
Recently-accessed file system metadata is stored in memory,...
{quote}
>From my understanding of this, in disk only part of metadata stored and 
>meanwhile in memory it caches recent-accessed data, does it really store all 
>metadata in its master service? does it match the case of Ozone Fs? In Ozone 
>Fs, we will store all metadata.

BTW, memory cached only for hot metadata is a good point that Ozone Fs can also 
benefit from this.

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
>  Labels: Triaged
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3816) Erasure Coding in Apache Hadoop Ozone

2020-06-24 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143863#comment-17143863
 ] 

Yiqun Lin commented on HDDS-3816:
-

Hi [~umamaheswararao], the design doc looks great. I go through all the design 
today and some comments from me.

The design doc introduces Container level, Block Level EC implementation and 
corresponding advantages/disadvantages. But it doesn't mentioned which is the 
final choice? Or that means we want to implement both of them and user can 
chose which way they prefer?

For the Container level option, it will be more easier to implement than Block 
Level option. But as design doc also mentioned that, it has more impact of this 
option, for example, delete operation impact(additionally need to implement 
small container merge), data recovery cost and high risk of data loss when some 
node crashed. From my personal opinion, Block level option is a more complete 
and robust implementation. How do we think of this?

For the read/write performance comparison, the Block level EC will have a 
better performance. The block is split into multiple nodes as a striped 
storage. We can parallel read/write the data based on this change. In Container 
level, the block data structure in one Container actually unchanged, it still 
keeps continuous way but just has a striped form in Container level. So the 
read/write rate is exactly not changed under Container level EC. We still need 
to find one specific Container node to read/write for specific block data.
  
 What's the implementation complexity of this two options way? Like can we 
perfectly integrated current HDFS EC algorithm implementation into Ozone? In 
order to support EC, if there will be a large code refactor in current 
read/write implementation? 

 

I see current EC design depends on the abstraction of storage-class 
implementation. I'm not sure if this is an easy thing to do at the beginning of 
Ozone EC implementation. Storage-class implementation is also a large feature I 
think, we define data storage type, policy and multiple rules to let system do 
the data transform automatically and transparently. This is similar to HDFS 
SSM(smart storage management) feature design in HDFS-7343. I'm not means to 
disagree of storage-class, but have a concern if we let this as one thing we 
must to implement first.

Please correct me if I am wrong, thanks.

> Erasure Coding in Apache Hadoop Ozone
> -
>
> Key: HDDS-3816
> URL: https://issues.apache.org/jira/browse/HDDS-3816
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: SCM
>Reporter: Uma Maheswara Rao G
>Priority: Major
> Attachments: Erasure Coding in Apache Hadoop Ozone.pdf
>
>
> We propose to implement Erasure Coding in Apache Hadoop Ozone to provide 
> efficient storage. With EC in place, Ozone can provide same or better 
> tolerance by giving 50% or more  storage space savings. 
> In HDFS project, we already have native codecs(ISAL) and Java codecs 
> implemented, we can leverage the same or similar codec design.
> However, the critical part of EC data layout design is in-progress, we will 
> post the design doc soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3755) Storage-class support for Ozone

2020-06-14 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135179#comment-17135179
 ] 

Yiqun Lin commented on HDDS-3755:
-

A very interesting design! Storage-class abstract makes Ozone data storage more 
smart.

One comment for this
{quote}Transfer rule: We can define some rule to describe when(condition or 
timer) invoke a convert action. For example, convert files from a storage-class 
to another storage-class when files lives 7 days.
{quote}
For the detailed of transfer rule between different storage-class, I think we 
can have a rule store and an independent service, called smart storage manager. 
This service can manage storage policy rules that passed by admin users. And 
this smart storage service will read given transfer rule and send SCM request o 
do corresponding actions.

> Storage-class support for Ozone
> ---
>
> Key: HDDS-3755
> URL: https://issues.apache.org/jira/browse/HDDS-3755
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>
> Use a storage-class as an abstraction which combines replication 
> configuration, container states and transitions. 
> See this thread for the detailed design doc:
>  
> [https://lists.apache.org/thread.html/r1e2a5d5581abe9dd09834305ca65a6807f37bd229a07b8b31bda32ad%40%3Cozone-dev.hadoop.apache.org%3E]
> which is also uploaded to here: 
> https://hackmd.io/4kxufJBOQNaKn7PKFK_6OQ?edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3698) Ozone Non-Rolling upgrades.

2020-06-09 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129495#comment-17129495
 ] 

Yiqun Lin commented on HDDS-3698:
-

Hi [~avijayan], a very good design for this. In this design, we talk about many 
upgrade challenges things in different components. But I am interested in 
detailed about what will happened when we upgrade failed and have to do 
downgrade behavior. 

Will we ensure the 100% consistency compared with trigger downgraded behavior 
before? I see we introduce the new table TransactionInfoTable and let new 
transactions write into this table. So from my understanding for this, this is 
a temporary table to store the transaction during upgrade procedure (not 
finalized). Since new feature operation is now allowed before finalized, so 
this table won't store incompatible type transaction. Then in downgrading, OM 
won't throw error when replying transaction table, right?

To protect our data before cluster be finalized, another thing I am thinking 
that we would be better not directly delete file data. We can move deleted 
container into a trash directory in Datanode. if downgraded happened, we will 
restore trash data back, if finalized, then delete them. This is the way that 
HDFS currently used in its upgrade, this also makes sense to Ozone upgrade.

> Ozone Non-Rolling upgrades.
> ---
>
> Key: HDDS-3698
> URL: https://issues.apache.org/jira/browse/HDDS-3698
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Aravindan Vijayan
>Assignee: Aravindan Vijayan
>Priority: Major
> Attachments: Ozone Non-Rolling Upgrades.pdf
>
>
> Support for Non-rolling upgrades in Ozone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2665) Implement new Ozone Filesystem scheme ofs://

2020-05-05 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099614#comment-17099614
 ] 

Yiqun Lin edited comment on HDDS-2665 at 5/5/20, 7:49 AM:
--

Hi [~smeng], I go through the implementation details of new ofs scheme. It's 
very similar to o3fs we currently implement but will have a better performance 
than single bucket way in o3fs. So I wonder if we will replace o3fs to use ofs 
once ofs schema is fully implemented? Or both keeping these two schemas and 
letting users to choose which one they prefer to use?


was (Author: linyiqun):
Hi [~smeng], I go though the implementation details of new ofs scheme. It's 
very similar to o3fs we currently implement but will have a better performance 
than single bucket way in o3fs. So I wonder if we will replace o3fs to use ofs 
once ofs schema is fully implemented? Or both keeping these two schemas and 
letting users to choose which one they prefer to use?

> Implement new Ozone Filesystem scheme ofs://
> 
>
> Key: HDDS-2665
> URL: https://issues.apache.org/jira/browse/HDDS-2665
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Attachments: Design ofs v1.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Implement a new scheme for Ozone Filesystem where all volumes (and buckets) 
> can be access from a single root.
> Alias: Rooted Ozone Filesystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2665) Implement new Ozone Filesystem scheme ofs://

2020-05-05 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099614#comment-17099614
 ] 

Yiqun Lin commented on HDDS-2665:
-

Hi [~smeng], I go though the implementation details of new ofs scheme. It's 
very similar to o3fs we currently implement but will have a better performance 
than single bucket way in o3fs. So I wonder if we will replace o3fs to use ofs 
once ofs schema is fully implemented? Or both keeping these two schemas and 
letting users to choose which one they prefer to use?

> Implement new Ozone Filesystem scheme ofs://
> 
>
> Key: HDDS-2665
> URL: https://issues.apache.org/jira/browse/HDDS-2665
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Attachments: Design ofs v1.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Implement a new scheme for Ozone Filesystem where all volumes (and buckets) 
> can be access from a single root.
> Alias: Rooted Ozone Filesystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-04-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072660#comment-17072660
 ] 

Yiqun Lin commented on HDDS-3241:
-

{quote}
Fix me if I am wrong, but in this case the containers are not unknown but 
additional replicas are detected (unless the full container is deleted in the 
mean time).
{quote}
I mean sometimes DN still contain stale containers that SCM already deleted.

{quote}
I am not sure if I understood, if some of the containers are valid, but some 
others are invalid, containers can be deleted.
{quote}
If we startup a completely wrong SCM, it almost cannot exit safemode I think. 
So I assume unknown deletion behavior can be safe. But as you mentioned, if 
only some  containers are invalid, it can still be deleted.


> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only prints error log and doesn't 
>  take any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Comment Edited] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-27 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069198#comment-17069198
 ] 

Yiqun Lin edited comment on HDDS-3241 at 3/28/20, 2:55 AM:
---

Thanks for the comments, [~elek] / [~msingh].

Actually current SCM safemode can also ensure this behavior is safe enough once 
we startup SCM with wrong container/pipeline db files. And then leads large 
containers deleted.  This should not happen because SCM won't exit safemode 
firstly since DN containers reported will not reach the safemode threshold 
anyway.

Also I have mentioned another case that in large clusters, the node sent to 
repair and come back to cluster again. SCM deletion behavior can help 
automation cleanup Datanode stale container datas. This is also one common 
cases.

I have updated the PR to make this configurable and disabled by default. Please 
help have a look, thanks.


was (Author: linyiqun):
Thanks for the comments, [~elek] / [~msingh].

Actually current SCM safemode can also protect this behavior once we startup 
SCM with wrong container/pipeline db files. And then leads large containers 
deleted.  This should not happen because SCM won't exit safemode firstly since 
DN containers reported will not reach the safemode threshold anyway.

Also I have mentioned another case that in large clusters, the node sent to 
repair and come back to cluster again. SCM deletion behavior can help 
automation cleanup Datanode stale container datas. This is also one common 
cases.

I have updated the PR to make this configurable and disabled by default. Please 
help have a look, thanks.

> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only prints error log and doesn't 
>  take any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> 

[jira] [Commented] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-27 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069198#comment-17069198
 ] 

Yiqun Lin commented on HDDS-3241:
-

Thanks for the comments, [~elek] / [~msingh].

Actually current SCM safemode can also protect this behavior once we startup 
SCM with wrong container/pipeline db files. And then leads large containers 
deleted.  This should not happen because SCM won't exit safemode firstly since 
DN containers reported will not reach the safemode threshold anyway.

Also I have mentioned another case that in large clusters, the node sent to 
repair and come back to cluster again. SCM deletion behavior can help 
automation cleanup Datanode stale container datas. This is also one common 
cases.

I have updated the PR to make this configurable and disabled by default. Please 
help have a look, thanks.

> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only prints error log and doesn't 
>  take any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> 

[jira] [Comment Edited] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063440#comment-17063440
 ] 

Yiqun Lin edited comment on HDDS-3241 at 3/20/20, 3:02 PM:
---

Hi [~elek],
 In HDFS world, invalid blocks reported to NameNode will be deleted. NameNode 
will reply DataNode with block deletion commands. 
 So I think this should be same for Ozone. But maybe deleting container will be 
a more expensive way since it stores more data. But I want to say, we could 
have a setting to control this action. By default, we keep current logic and 
just log an error. This just depends on the users that how they want SCM to do.

For example, I rebuild my test cluster and still use same Datanode. And these 
Datanodes always keep reporting stale containers. Yes,  I can deletie these 
containers manually. But  it would be better if SCM can help send deletion 
containers to these Datanodes.


was (Author: linyiqun):
Hi [~elek],
 In HDFS world, invalid blocks reported to NameNode will be deleted. NameNode 
will reply DataNode with block deletion commands. 
 So I think this should be same for Ozone. But maybe deleting container will be 
a more expensive way since it stores more data. But I want to say, we could 
have a setting to control this action. By default, we keep current logic and 
just log an error. This just depends on the users that how they want SCM to do.

For example, I rebuild my test cluster and still uses previous Datanode. And 
these Datanodes keep reporting stale containers. Maybe I can deletion these 
containers manually. This would be better if SCM can help send deletion 
containers to these Datanodes.

> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only prints error log and doesn't 
>  take any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> 

[jira] [Commented] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063440#comment-17063440
 ] 

Yiqun Lin commented on HDDS-3241:
-

Hi [~elek],
 In HDFS world, invalid blocks reported to NameNode will be deleted. NameNode 
will reply DataNode with block deletion commands. 
 So I think this should be same for Ozone. But maybe deleting container will be 
a more expensive way since it stores more data. But I want to say, we could 
have a setting to control this action. By default, we keep current logic and 
just log an error. This just depends on the users that how they want SCM to do.

For example, I rebuild my test cluster and still uses previous Datanode. And 
these Datanodes keep reporting stale containers. Maybe I can deletion these 
containers manually. This would be better if SCM can help send deletion 
containers to these Datanodes.

> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only prints error log and doesn't 
>  take any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> 

[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3241:

Description: 
For the invalid or out-updated container reported by Datanode, 
ContainerReportHandler in SCM only prints error log and doesn't 
 take any action.

{noformat}
2020-03-15 05:19:41,072 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 37 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #37 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 
org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-03-15 05:19:41,073 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 38 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #38 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 
org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{noformat}

Actually SCM should inform Datanode to delete its outdated container. 
Otherwise, Datanode will always report this invalid container and this dirty 
container data will be always kept in Datanode. Sometimes, we bring back a node 
that be repaired and it maybe stores stale data and we should have a way to 
auto-cleanup them.

We could have a setting to control this auto-deletion behavior if this is a 
little risk approach.
 

  was:
For the invalid or out-updated container reported by Datanode, 
ContainerReportHandler in SCM only prints error log and doesn't 
 take any action.

{noformat}
2020-03-15 05:19:41,072 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 37 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #37 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 

[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3241:

Description: 
For the invalid or out-updated container reported by Datanode, 
ContainerReportHandler in SCM only prints error log and doesn't 
 take any action.

{noformat}
2020-03-15 05:19:41,072 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 37 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #37 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 
org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-03-15 05:19:41,073 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 38 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #38 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 
org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{noformat}

Actually SCM should inform Datanode to delete its outdated container. 
Otherwise, Datanode will always report this invalid container and this dirty 
container data will be always kept in Datanode. Sometimes, we bring back a node 
that be repaired and it maybe stores stale data.

We could have a setting to control this auto-deletion behavior if this is a 
little risk approach.
 

  was:
For the invalid or out-updated container reported by Datanode, 
ContainerReportHandler in SCM only print error log and doesn't any action.

{noformat}
2020-03-15 05:19:41,072 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 37 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #37 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 

[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3241:

Status: Patch Available  (was: Open)

> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only print error log and doesn't any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> Actually SCM should inform Datanode to delete its outdated container. 
> Otherwise, Datanode will always report this invalid container and this dirty 
> container data will be always kept in Datanode. Sometimes, we bring back a 
> node that be repaired and it maybe stores stale data.
> We could have a setting to control this auto-deletion behavior if this is a 
> little risk approach.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3241:

Affects Version/s: 0.4.1

> Invalid container reported to SCM should be deleted
> ---
>
> Key: HDDS-3241
> URL: https://issues.apache.org/jira/browse/HDDS-3241
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>
> For the invalid or out-updated container reported by Datanode, 
> ContainerReportHandler in SCM only print error log and doesn't any action.
> {noformat}
> 2020-03-15 05:19:41,072 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 37 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #37 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-03-15 05:19:41,073 ERROR 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received 
> container report for an unknown container 38 from datanode 
> 0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
> networkLocation: /dc2/rack1, certSerialId: null}.
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container 
> with id #38 not found.
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
> at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
> at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> Actually SCM should inform Datanode to delete its outdated container. 
> Otherwise, Datanode will always report this invalid container and this dirty 
> container data will be always kept in Datanode. Sometimes, we bring back a 
> node that be repaired and it maybe stores stale data.
> We could have a setting to control this auto-deletion behavior if this is a 
> little risk approach.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3241) Invalid container reported to SCM should be deleted

2020-03-20 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-3241:
---

 Summary: Invalid container reported to SCM should be deleted
 Key: HDDS-3241
 URL: https://issues.apache.org/jira/browse/HDDS-3241
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Yiqun Lin
Assignee: Yiqun Lin


For the invalid or out-updated container reported by Datanode, 
ContainerReportHandler in SCM only print error log and doesn't any action.

{noformat}
2020-03-15 05:19:41,072 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 37 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #37 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 
org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-03-15 05:19:41,073 ERROR 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Received container 
report for an unknown container 38 from datanode 
0d98dfab-9d34-46c3-93fd-6b64b65ff543{ip: xx.xx.xx.xx, host: lyq-xx.xx.xx.xx, 
networkLocation: /dc2/rack1, certSerialId: null}.
org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: Container with 
id #38 not found.
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:542)
at 
org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.getContainerInfo(ContainerStateMap.java:188)
at 
org.apache.hadoop.hdds.scm.container.ContainerStateManager.getContainer(ContainerStateManager.java:484)
at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainer(SCMContainerManager.java:204)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:85)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:126)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:97)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:46)
at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{noformat}

Actually SCM should inform Datanode to delete its outdated container. 
Otherwise, Datanode will always report this invalid container and this dirty 
container data will be always kept in Datanode. Sometimes, we bring back a node 
that be repaired and it maybe stores stale data.

We could have a setting to control this auto-deletion behavior if this is a 
little risk approach.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state

2020-03-16 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060552#comment-17060552
 ] 

Yiqun Lin commented on HDDS-3180:
-

Thanks [~xyao] for the review and merge.

> Datanode fails to start due to confused inconsistent volume state
> -
>
> Key: HDDS-3180
> URL: https://issues.apache.org/jira/browse/HDDS-3180
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I meet an error in my testing ozone cluster when I restart datanode. From the 
> log, it throws inconsistent volume state but without other detailed helpful 
> info:
> {noformat}
> 2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
> HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
> 2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
> Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
> 20063645696
> 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
> to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
> java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
> volume: /tmp/hadoop-hdfs/dfs/data/hdds
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
> at picocli.CommandLine.execute(CommandLine.java:1173)
> at picocli.CommandLine.access$800(CommandLine.java:141)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
> at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
> at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
> at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137)
> 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO  (LogAdapter.java:51) - 
> SHUTDOWN_MSG:
> {noformat}
> Then I look into the code and the root cause is that the version file was 
> lost in that node.
> We need to log key message as well to help user quickly know the root cause 
> of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state

2020-03-14 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3180:

Status: Patch Available  (was: Open)

> Datanode fails to start due to confused inconsistent volume state
> -
>
> Key: HDDS-3180
> URL: https://issues.apache.org/jira/browse/HDDS-3180
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I meet an error in my testing ozone cluster when I restart datanode. From the 
> log, it throws inconsistent volume state but without other detailed helpful 
> info:
> {noformat}
> 2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
> HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
> 2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
> Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
> 20063645696
> 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
> to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
> java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
> volume: /tmp/hadoop-hdfs/dfs/data/hdds
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
> at picocli.CommandLine.execute(CommandLine.java:1173)
> at picocli.CommandLine.access$800(CommandLine.java:141)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
> at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
> at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
> at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137)
> 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO  (LogAdapter.java:51) - 
> SHUTDOWN_MSG:
> {noformat}
> Then I look into the code and the root cause is that the version file was 
> lost in that node.
> We need to log key message as well to help user quickly know the root cause 
> of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state

2020-03-14 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059327#comment-17059327
 ] 

Yiqun Lin edited comment on HDDS-3180 at 3/14/20, 12:29 PM:


We need to additionally add log for the inconsistent state because this state 
will lead Datanode failed to start.

A more friendly message tested in my local:
{noformat}
2020-03-14 04:41:27,249 [main] INFO  (HddsVolume.java:177) - Creating 
Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
9997713408
2020-03-14 04:41:27,250 [main] WARN  (HddsVolume.java:252) - VERSION file 
does not exist in volume /tmp/hadoop-hdfs/dfs/data/hdds, current volume state: 
INCONSISTENT.
2020-03-14 04:41:27,257 [main] ERROR (MutableVolumeSet.java:202) - Failed 
to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
volume: /tmp/hadoop-hdfs/dfs/data/hdds
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
at 
org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
{noformat}


was (Author: linyiqun):
We need to additionally add log for the inconsistent state because this state 
will lead Datanode failed to start.

> Datanode fails to start due to confused inconsistent volume state
> -
>
> Key: HDDS-3180
> URL: https://issues.apache.org/jira/browse/HDDS-3180
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I meet an error in my testing ozone cluster when I restart datanode. From the 
> log, it throws inconsistent volume state but without other detailed helpful 
> info:
> {noformat}
> 2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
> HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
> 2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
> Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
> 20063645696
> 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
> to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
> java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
> volume: /tmp/hadoop-hdfs/dfs/data/hdds
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
> at picocli.CommandLine.execute(CommandLine.java:1173)
> at picocli.CommandLine.access$800(CommandLine.java:141)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
> at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
> 

[jira] [Commented] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state

2020-03-14 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059327#comment-17059327
 ] 

Yiqun Lin commented on HDDS-3180:
-

We need to additionally add log for the inconsistent state because this state 
will lead Datanode failed to start.

> Datanode fails to start due to confused inconsistent volume state
> -
>
> Key: HDDS-3180
> URL: https://issues.apache.org/jira/browse/HDDS-3180
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I meet an error in my testing ozone cluster when I restart datanode. From the 
> log, it throws inconsistent volume state but without other detailed helpful 
> info:
> {noformat}
> 2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
> HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
> 2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
> Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
> 20063645696
> 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
> to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
> java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
> volume: /tmp/hadoop-hdfs/dfs/data/hdds
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
> at picocli.CommandLine.execute(CommandLine.java:1173)
> at picocli.CommandLine.access$800(CommandLine.java:141)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
> at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
> at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
> at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137)
> 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO  (LogAdapter.java:51) - 
> SHUTDOWN_MSG:
> {noformat}
> Then I look into the code and the root cause is that the version file was 
> lost in that node.
> We need to log key message as well to help user quickly know the root cause 
> of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3180) Datanode fails to start due to confused inconsistent volume state

2020-03-14 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3180:

Summary: Datanode fails to start due to confused inconsistent volume state  
(was: Datanode fails to start due to inconsistent volume state without helpful 
error message)

> Datanode fails to start due to confused inconsistent volume state
> -
>
> Key: HDDS-3180
> URL: https://issues.apache.org/jira/browse/HDDS-3180
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>
> I meet an error in my testing ozone cluster when I restart datanode. From the 
> log, it throws inconsistent volume state but without other detailed helpful 
> info:
> {noformat}
> 2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
> HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
> 2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
> Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
> 20063645696
> 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
> to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
> java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
> volume: /tmp/hadoop-hdfs/dfs/data/hdds
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
> at picocli.CommandLine.execute(CommandLine.java:1173)
> at picocli.CommandLine.access$800(CommandLine.java:141)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
> at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
> at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
> at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137)
> 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO  (LogAdapter.java:51) - 
> SHUTDOWN_MSG:
> {noformat}
> Then I look into the code and the root cause is that the version file was 
> lost in that node.
> We need to log key message as well to help user quickly know the root cause 
> of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3180) Datanode fails to start due to inconsistent volume state without helpful error message

2020-03-14 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3180:

Summary: Datanode fails to start due to inconsistent volume state without 
helpful error message  (was: Datanode shutdown due to inconsistent volume state 
without helpful error message)

> Datanode fails to start due to inconsistent volume state without helpful 
> error message
> --
>
> Key: HDDS-3180
> URL: https://issues.apache.org/jira/browse/HDDS-3180
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>
> I meet an error in my testing ozone cluster when I restart datanode. From the 
> log, it throws inconsistent volume state but without other detailed helpful 
> info:
> {noformat}
> 2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
> HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
> 2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
> Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
> 20063645696
> 2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
> to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
> java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
> volume: /tmp/hadoop-hdfs/dfs/data/hdds
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
> at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
> at 
> org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
> at picocli.CommandLine.execute(CommandLine.java:1173)
> at picocli.CommandLine.access$800(CommandLine.java:141)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
> at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
> at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
> at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137)
> 2020-03-14 02:31:46,795 [shutdown-hook-0] INFO  (LogAdapter.java:51) - 
> SHUTDOWN_MSG:
> {noformat}
> Then I look into the code and the root cause is that the version file was 
> lost in that node.
> We need to log key message as well to help user quickly know the root cause 
> of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3180) Datanode shutdown due to inconsistent volume state without helpful error message

2020-03-14 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-3180:
---

 Summary: Datanode shutdown due to inconsistent volume state 
without helpful error message
 Key: HDDS-3180
 URL: https://issues.apache.org/jira/browse/HDDS-3180
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Affects Versions: 0.4.1
Reporter: Yiqun Lin
Assignee: Yiqun Lin


I meet an error in my testing ozone cluster when I restart datanode. From the 
log, it throws inconsistent volume state but without other detailed helpful 
info:
{noformat}
2020-03-14 02:31:46,204 [main] INFO  (LogAdapter.java:51) - registered UNIX 
signal handlers for [TERM, HUP, INT]
2020-03-14 02:31:46,736 [main] INFO  (HddsDatanodeService.java:204) - 
HddsDatanodeService host:lyq-xx.xx.xx.xx ip:xx.xx.xx.xx
2020-03-14 02:31:46,784 [main] INFO  (HddsVolume.java:177) - Creating 
Volume: /tmp/hadoop-hdfs/dfs/data/hdds of storage type : DISK and capacity : 
20063645696
2020-03-14 02:31:46,786 [main] ERROR (MutableVolumeSet.java:202) - Failed 
to parse the storage location: file:///tmp/hadoop-hdfs/dfs/data
java.io.IOException: Volume is in an INCONSISTENT state. Skipped loading 
volume: /tmp/hadoop-hdfs/dfs/data/hdds
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.initialize(HddsVolume.java:226)
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:180)
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.(HddsVolume.java:71)
at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume$Builder.build(HddsVolume.java:158)
at 
org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.createVolume(MutableVolumeSet.java:336)
at 
org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.initializeVolumeSet(MutableVolumeSet.java:183)
at 
org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:139)
at 
org.apache.hadoop.ozone.container.common.volume.MutableVolumeSet.(MutableVolumeSet.java:111)
at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:97)
at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:128)
at 
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
at 
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:179)
at 
org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:154)
at 
org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:78)
at picocli.CommandLine.execute(CommandLine.java:1173)
at picocli.CommandLine.access$800(CommandLine.java:141)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
at 
picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
at 
org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:137)
2020-03-14 02:31:46,795 [shutdown-hook-0] INFO  (LogAdapter.java:51) - 
SHUTDOWN_MSG:
{noformat}

Then I look into the code and the root cause is that the version file was lost 
in that node.
We need to log key message as well to help user quickly know the root cause of 
this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3111) Add unit test for container replication behavior under different container placement policy

2020-03-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3111:

Status: Patch Available  (was: Open)

> Add unit test for container replication behavior under different container 
> placement policy
> ---
>
> Key: HDDS-3111
> URL: https://issues.apache.org/jira/browse/HDDS-3111
> Project: Hadoop Distributed Data Store
>  Issue Type: Test
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, the unit test for ReplicationManager only tested for container 
> state change and container placement policy only focus on the policy 
> algorithm.
> And we lack of one integration unit test for testing container replication 
> behavior under different container placement policy. Including some corner 
> cases, like not enough candidate node, fallback cases in rack awareness 
> policy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3111) Add unit test for container replication behavior under different container placement policy

2020-03-01 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-3111:
---

 Summary: Add unit test for container replication behavior under 
different container placement policy
 Key: HDDS-3111
 URL: https://issues.apache.org/jira/browse/HDDS-3111
 Project: Hadoop Distributed Data Store
  Issue Type: Test
Reporter: Yiqun Lin
Assignee: Yiqun Lin


Currently, the unit test for ReplicationManager only tested for container state 
change and container placement policy only focus on the policy algorithm.
And we lack of one integration unit test for testing container replication 
behavior under different container placement policy. Including some corner 
cases, like not enough candidate node, fallback cases in rack awareness policy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API

2020-02-26 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045595#comment-17045595
 ] 

Yiqun Lin edited comment on HDDS-3058 at 2/26/20 2:49 PM:
--

I verified the change in my local, it will break current ozone fs put command.  
Above methods will be triggered in process of put command. I'd like to close 
this JIRA as invalidate since I cannot find a better way to solve this 
temperatory, :D.

{quote}
Yiqun Lin, thanks for reporting this. We plan to improve FS API under a 
umbrella JIRA HDDS-3048. Feel free to join us if you have interest.
{quote}
I will take a look for that, thanks for the reference, sammi!


was (Author: linyiqun):
I verified the change in my local, it will break current ozone fs put command.  
Above methods will be triggered in process of put command. I'd like to close 
this JIRA as invalidate since I cannot find a better way to solve this, :D.

{quote}
Yiqun Lin, thanks for reporting this. We plan to improve FS API under a 
umbrella JIRA HDDS-3048. Feel free to join us if you have interest.
{quote}
I will take a look for that, thanks for the reference, sammi!

> OzoneFileSystem should override unsupported set type FileSystem API
> ---
>
> Key: HDDS-3058
> URL: https://issues.apache.org/jira/browse/HDDS-3058
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
> and most of other API are not supported and inherited from parent class 
> FileSystem by default. However, FileSystem do nothing in some set type 
> method, like setReplication, setOwner.
> {code:java}
>  public void setVerifyChecksum(boolean verifyChecksum) {
> //doesn't do anything
>   }
>   public void setWriteChecksum(boolean writeChecksum) {
> //doesn't do anything
>   }
>   public boolean setReplication(Path src, short replication)
> throws IOException {
> return true;
>   }
>   public void setPermission(Path p, FsPermission permission
>   ) throws IOException {
>   }
>   public void setOwner(Path p, String username, String groupname
>   ) throws IOException {
>   }
>   public void setTimes(Path p, long mtime, long atime
>   ) throws IOException {
>   }
> {code}
> This set type functions depend on the sub-filesystem implementation. We need 
> to to throw unsupported exception if sub-filesystem cannot support this. 
> Otherwise, it will make users confused to use hadoop fs -setrep command or 
> call setReplication api. Users will not see any exception but the command/API 
> can execute fine. This is happened when I tested for the OzoneFileSystem via 
> hadoop fs command way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API

2020-02-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3058:

Resolution: Invalid
Status: Resolved  (was: Patch Available)

I verified the change in my local, it will break current ozone fs put command.  
Above methods will be triggered in process of put command. I'd like to close 
this JIRA as invalidate since I cannot find a better way to solve this, :D.

{quote}
Yiqun Lin, thanks for reporting this. We plan to improve FS API under a 
umbrella JIRA HDDS-3048. Feel free to join us if you have interest.
{quote}
I will take a look for that, thanks for the reference, sammi!

> OzoneFileSystem should override unsupported set type FileSystem API
> ---
>
> Key: HDDS-3058
> URL: https://issues.apache.org/jira/browse/HDDS-3058
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
> and most of other API are not supported and inherited from parent class 
> FileSystem by default. However, FileSystem do nothing in some set type 
> method, like setReplication, setOwner.
> {code:java}
>  public void setVerifyChecksum(boolean verifyChecksum) {
> //doesn't do anything
>   }
>   public void setWriteChecksum(boolean writeChecksum) {
> //doesn't do anything
>   }
>   public boolean setReplication(Path src, short replication)
> throws IOException {
> return true;
>   }
>   public void setPermission(Path p, FsPermission permission
>   ) throws IOException {
>   }
>   public void setOwner(Path p, String username, String groupname
>   ) throws IOException {
>   }
>   public void setTimes(Path p, long mtime, long atime
>   ) throws IOException {
>   }
> {code}
> This set type functions depend on the sub-filesystem implementation. We need 
> to to throw unsupported exception if sub-filesystem cannot support this. 
> Otherwise, it will make users confused to use hadoop fs -setrep command or 
> call setReplication api. Users will not see any exception but the command/API 
> can execute fine. This is happened when I tested for the OzoneFileSystem via 
> hadoop fs command way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3070) NPE when stop recon server while recon server was not really started before

2020-02-25 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3070:

Status: Patch Available  (was: Open)

> NPE when stop recon server while recon server was not really started before
> ---
>
> Key: HDDS-3070
> URL: https://issues.apache.org/jira/browse/HDDS-3070
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Recon
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I met a NPE error when I did test for Ozone. Seems the root cause is that 
> recon server was not really started however we still try to stop it.
> {noformat}
> 2020-02-25 20:22:44,296 [Thread-0] ERROR ozone.MiniOzoneClusterImpl 
> (MiniOzoneClusterImpl.java:build(525)) - Exception while shutting down the 
> Recon.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.stop(ReconTaskControllerImpl.java:237)
>   at 
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.stop(OzoneManagerServiceProviderImpl.java:229)
>   at org.apache.hadoop.ozone.recon.ReconServer.stop(ReconServer.java:132)
>   at 
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopRecon(MiniOzoneClusterImpl.java:470)
>   at 
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.access$200(MiniOzoneClusterImpl.java:87)
>   at 
> org.apache.hadoop.ozone.MiniOzoneClusterImpl$Builder.build(MiniOzoneClusterImpl.java:523)
>   at 
> org.apache.hadoop.fs.ozone.TestOzoneFileSystem.testFileSystem(TestOzoneFileSystem.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3070) NPE when stop recon server while recon server was not really started before

2020-02-25 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3070:

Summary: NPE when stop recon server while recon server was not really 
started before  (was: NPE when stop recon server while recon server was not 
really started)

> NPE when stop recon server while recon server was not really started before
> ---
>
> Key: HDDS-3070
> URL: https://issues.apache.org/jira/browse/HDDS-3070
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Recon
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> I met a NPE error when I did test for Ozone. Seems the root cause is that 
> recon server was not really started however we still try to stop it.
> {noformat}
> 2020-02-25 20:22:44,296 [Thread-0] ERROR ozone.MiniOzoneClusterImpl 
> (MiniOzoneClusterImpl.java:build(525)) - Exception while shutting down the 
> Recon.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.stop(ReconTaskControllerImpl.java:237)
>   at 
> org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.stop(OzoneManagerServiceProviderImpl.java:229)
>   at org.apache.hadoop.ozone.recon.ReconServer.stop(ReconServer.java:132)
>   at 
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopRecon(MiniOzoneClusterImpl.java:470)
>   at 
> org.apache.hadoop.ozone.MiniOzoneClusterImpl.access$200(MiniOzoneClusterImpl.java:87)
>   at 
> org.apache.hadoop.ozone.MiniOzoneClusterImpl$Builder.build(MiniOzoneClusterImpl.java:523)
>   at 
> org.apache.hadoop.fs.ozone.TestOzoneFileSystem.testFileSystem(TestOzoneFileSystem.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3070) NPE when stop recon server while recon server was not really started

2020-02-25 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-3070:
---

 Summary: NPE when stop recon server while recon server was not 
really started
 Key: HDDS-3070
 URL: https://issues.apache.org/jira/browse/HDDS-3070
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Recon
Affects Versions: 0.4.1
Reporter: Yiqun Lin
Assignee: Yiqun Lin


I met a NPE error when I did test for Ozone. Seems the root cause is that recon 
server was not really started however we still try to stop it.

{noformat}
2020-02-25 20:22:44,296 [Thread-0] ERROR ozone.MiniOzoneClusterImpl 
(MiniOzoneClusterImpl.java:build(525)) - Exception while shutting down the 
Recon.
java.lang.NullPointerException
at 
org.apache.hadoop.ozone.recon.tasks.ReconTaskControllerImpl.stop(ReconTaskControllerImpl.java:237)
at 
org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.stop(OzoneManagerServiceProviderImpl.java:229)
at org.apache.hadoop.ozone.recon.ReconServer.stop(ReconServer.java:132)
at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.stopRecon(MiniOzoneClusterImpl.java:470)
at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.access$200(MiniOzoneClusterImpl.java:87)
at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl$Builder.build(MiniOzoneClusterImpl.java:523)
at 
org.apache.hadoop.fs.ozone.TestOzoneFileSystem.testFileSystem(TestOzoneFileSystem.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API

2020-02-23 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3058:

Status: Patch Available  (was: Open)

> OzoneFileSystem should override unsupported set type FileSystem API
> ---
>
> Key: HDDS-3058
> URL: https://issues.apache.org/jira/browse/HDDS-3058
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
> and most of other API are not supported and inherited from parent class 
> FileSystem by default. However, FileSystem do nothing in some set type 
> method, like setReplication, setOwner.
> {code:java}
>  public void setVerifyChecksum(boolean verifyChecksum) {
> //doesn't do anything
>   }
>   public void setWriteChecksum(boolean writeChecksum) {
> //doesn't do anything
>   }
>   public boolean setReplication(Path src, short replication)
> throws IOException {
> return true;
>   }
>   public void setPermission(Path p, FsPermission permission
>   ) throws IOException {
>   }
>   public void setOwner(Path p, String username, String groupname
>   ) throws IOException {
>   }
>   public void setTimes(Path p, long mtime, long atime
>   ) throws IOException {
>   }
> {code}
> This set type functions depend on the sub-filesystem implementation. We need 
> to to throw unsupported exception if sub-filesystem cannot support this. 
> Otherwise, it will make users confused to use hadoop fs -setrep command or 
> call setReplication api. Users will not see any exception but the command/API 
> can execute fine. This is happened when I tested for the OzoneFileSystem via 
> hadoop fs command way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API

2020-02-23 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-3058:
---

 Summary: OzoneFileSystem should override unsupported set type 
FileSystem API
 Key: HDDS-3058
 URL: https://issues.apache.org/jira/browse/HDDS-3058
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Ozone Filesystem
Affects Versions: 0.4.1
Reporter: Yiqun Lin
Assignee: Yiqun Lin


Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
and most of other API are not supported and inherited from parent class 
FileSystem by default. However, FileSystem do nothing in some set type method, 
like setReplication, setOwner. 

{code}
 public void setVerifyChecksum(boolean verifyChecksum) {
//doesn't do anything
  }

  public void setWriteChecksum(boolean writeChecksum) {
//doesn't do anything
  }

  public boolean setReplication(Path src, short replication)
throws IOException {
return true;
  }

  public void setPermission(Path p, FsPermission permission
  ) throws IOException {
  }

  public void setOwner(Path p, String username, String groupname
  ) throws IOException {
  }

  public void setTimes(Path p, long mtime, long atime
  ) throws IOException {
  }
{code}

This set type functions depend on the sub-filesystem implementation. We need to 
to throw unsupported exception if sub-filesystem cannot support this. 
Otherwise, it will make users confused to use hadoop fs -setrep command or call 
setReplication api. Users will not see any exception but the command can 
execute fine. This is happened when I tested for the OzoneFileSystem via hadoop 
fs command way.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API

2020-02-23 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3058:

Description: 
Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
and most of other API are not supported and inherited from parent class 
FileSystem by default. However, FileSystem do nothing in some set type method, 
like setReplication, setOwner.
{code:java}
 public void setVerifyChecksum(boolean verifyChecksum) {
//doesn't do anything
  }

  public void setWriteChecksum(boolean writeChecksum) {
//doesn't do anything
  }

  public boolean setReplication(Path src, short replication)
throws IOException {
return true;
  }

  public void setPermission(Path p, FsPermission permission
  ) throws IOException {
  }

  public void setOwner(Path p, String username, String groupname
  ) throws IOException {
  }

  public void setTimes(Path p, long mtime, long atime
  ) throws IOException {
  }
{code}
This set type functions depend on the sub-filesystem implementation. We need to 
to throw unsupported exception if sub-filesystem cannot support this. 
Otherwise, it will make users confused to use hadoop fs -setrep command or call 
setReplication api. Users will not see any exception but the command/API can 
execute fine. This is happened when I tested for the OzoneFileSystem via hadoop 
fs command way.

  was:
Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
and most of other API are not supported and inherited from parent class 
FileSystem by default. However, FileSystem do nothing in some set type method, 
like setReplication, setOwner. 

{code}
 public void setVerifyChecksum(boolean verifyChecksum) {
//doesn't do anything
  }

  public void setWriteChecksum(boolean writeChecksum) {
//doesn't do anything
  }

  public boolean setReplication(Path src, short replication)
throws IOException {
return true;
  }

  public void setPermission(Path p, FsPermission permission
  ) throws IOException {
  }

  public void setOwner(Path p, String username, String groupname
  ) throws IOException {
  }

  public void setTimes(Path p, long mtime, long atime
  ) throws IOException {
  }
{code}

This set type functions depend on the sub-filesystem implementation. We need to 
to throw unsupported exception if sub-filesystem cannot support this. 
Otherwise, it will make users confused to use hadoop fs -setrep command or call 
setReplication api. Users will not see any exception but the command can 
execute fine. This is happened when I tested for the OzoneFileSystem via hadoop 
fs command way.



> OzoneFileSystem should override unsupported set type FileSystem API
> ---
>
> Key: HDDS-3058
> URL: https://issues.apache.org/jira/browse/HDDS-3058
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>
> Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
> and most of other API are not supported and inherited from parent class 
> FileSystem by default. However, FileSystem do nothing in some set type 
> method, like setReplication, setOwner.
> {code:java}
>  public void setVerifyChecksum(boolean verifyChecksum) {
> //doesn't do anything
>   }
>   public void setWriteChecksum(boolean writeChecksum) {
> //doesn't do anything
>   }
>   public boolean setReplication(Path src, short replication)
> throws IOException {
> return true;
>   }
>   public void setPermission(Path p, FsPermission permission
>   ) throws IOException {
>   }
>   public void setOwner(Path p, String username, String groupname
>   ) throws IOException {
>   }
>   public void setTimes(Path p, long mtime, long atime
>   ) throws IOException {
>   }
> {code}
> This set type functions depend on the sub-filesystem implementation. We need 
> to to throw unsupported exception if sub-filesystem cannot support this. 
> Otherwise, it will make users confused to use hadoop fs -setrep command or 
> call setReplication api. Users will not see any exception but the command/API 
> can execute fine. This is happened when I tested for the OzoneFileSystem via 
> hadoop fs command way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3058) OzoneFileSystem should override unsupported set type FileSystem API

2020-02-23 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-3058:

Issue Type: Bug  (was: Improvement)

> OzoneFileSystem should override unsupported set type FileSystem API
> ---
>
> Key: HDDS-3058
> URL: https://issues.apache.org/jira/browse/HDDS-3058
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>
> Currently, OzoneFileSystem only implements some common useful FileSystem APIs 
> and most of other API are not supported and inherited from parent class 
> FileSystem by default. However, FileSystem do nothing in some set type 
> method, like setReplication, setOwner. 
> {code}
>  public void setVerifyChecksum(boolean verifyChecksum) {
> //doesn't do anything
>   }
>   public void setWriteChecksum(boolean writeChecksum) {
> //doesn't do anything
>   }
>   public boolean setReplication(Path src, short replication)
> throws IOException {
> return true;
>   }
>   public void setPermission(Path p, FsPermission permission
>   ) throws IOException {
>   }
>   public void setOwner(Path p, String username, String groupname
>   ) throws IOException {
>   }
>   public void setTimes(Path p, long mtime, long atime
>   ) throws IOException {
>   }
> {code}
> This set type functions depend on the sub-filesystem implementation. We need 
> to to throw unsupported exception if sub-filesystem cannot support this. 
> Otherwise, it will make users confused to use hadoop fs -setrep command or 
> call setReplication api. Users will not see any exception but the command can 
> execute fine. This is happened when I tested for the OzoneFileSystem via 
> hadoop fs command way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2972) Any container replication error can terminate SCM service

2020-02-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2972:

Description: 
I found there any container replication error thrown in ReplicationManager can 
terminates SCM service. It's a very expensive behavior to terminate the SCM 
service just because of one container replication error.

It's not worth to shutdown the SCM. We can be friendly to deal with this, catch 
the exception and print the warn message with thrown exception.

The shutdown info:
{noformat}
2020-01-30 08:16:04,705 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
at java.lang.Thread.run(Thread.java:745)
2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
2020-01-30 08:16:04,734 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{noformat}

  was:
I found there any container replication error running in ReplicationManager can 
terminates SCM service. It's a very expensive behavior to terminate the SCM 
service just because of one container replication error.

It's not worth to shutdown the SCM. We can be friendly to deal with this, catch 
the exception and print the warn message with thrown exception.

The shutdown info:
{noformat}
2020-01-30 08:16:04,705 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
at java.lang.Thread.run(Thread.java:745)
2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
2020-01-30 08:16:04,734 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{noformat}


> Any container replication error can terminate SCM service
> -
>
> Key: HDDS-2972
> URL: https://issues.apache.org/jira/browse/HDDS-2972
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found there any container replication error 

[jira] [Updated] (HDDS-2972) Any container replication error can terminate SCM service

2020-02-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2972:

Status: Patch Available  (was: Open)

> Any container replication error can terminate SCM service
> -
>
> Key: HDDS-2972
> URL: https://issues.apache.org/jira/browse/HDDS-2972
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found there any container replication error running in ReplicationManager 
> can terminates SCM service. It's a very expensive behavior to terminate the 
> SCM service just because of one container replication error.
> It's not worth to shutdown the SCM. We can be friendly to deal with this, 
> catch the exception and print the warn message with thrown exception.
> The shutdown info:
> {noformat}
> 2020-01-30 08:16:04,705 ERROR 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
> Replication Monitor Thread.
> java.lang.IllegalArgumentException: Affinity node 
> /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
> at 
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
> at 
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
> at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
> at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
> at 
> java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
> at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
> at java.lang.Thread.run(Thread.java:745)
> 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: java.lang.IllegalArgumentException: Affinity node 
> /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
> 2020-01-30 08:16:04,734 INFO 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: 
> SHUTDOWN_MSG:
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2972) Any container replication error can terminate SCM service

2020-02-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2972:

Summary: Any container replication error can terminate SCM service  (was: 
Any container replication error can terminates SCM service)

> Any container replication error can terminate SCM service
> -
>
> Key: HDDS-2972
> URL: https://issues.apache.org/jira/browse/HDDS-2972
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
>
> I found there any container replication error running in ReplicationManager 
> can terminates SCM service. It's a very expensive behavior to terminate the 
> SCM service just because of one container replication error.
> It's not worth to shutdown the SCM. We can be friendly to deal with this, 
> catch the exception and print the warn message with thrown exception.
> The shutdown info:
> {noformat}
> 2020-01-30 08:16:04,705 ERROR 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
> Replication Monitor Thread.
> java.lang.IllegalArgumentException: Affinity node 
> /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
> at 
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
> at 
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
> at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
> at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
> at 
> java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
> at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
> at java.lang.Thread.run(Thread.java:745)
> 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: java.lang.IllegalArgumentException: Affinity node 
> /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
> 2020-01-30 08:16:04,734 INFO 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: 
> SHUTDOWN_MSG:
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2972) Any container replication error can terminates SCM service

2020-02-01 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-2972:
---

 Summary: Any container replication error can terminates SCM service
 Key: HDDS-2972
 URL: https://issues.apache.org/jira/browse/HDDS-2972
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: SCM
Affects Versions: 0.4.1
Reporter: Yiqun Lin
Assignee: Yiqun Lin


I found there any container replication error running in ReplicationManager can 
terminates SCM service. It's a very expensive behavior to terminate the SCM 
service just because of one container replication error.

It's not worth to shutdown the SCM. We can be friendly to deal with this, catch 
the exception and print the warn message with thrown exception.

The shutdown info:
{noformat}
2020-01-30 08:16:04,705 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
at java.lang.Thread.run(Thread.java:745)
2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
2020-01-30 08:16:04,734 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2939) Ozone FS namespace

2020-01-29 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025842#comment-17025842
 ] 

Yiqun Lin edited comment on HDDS-2939 at 1/29/20 12:46 PM:
---

Hi [~sdeka], I am reading for this design doc, some comments from me:

For the Filesystem Namespace Operations, the ls(list files/folders) operation 
will also a common operation. But under current implementation, for example, 
list a directory, we have to traverse the whole directory/file table to lookup 
the child file/sub-folders.This is an ineffective way. I know the  
lookup way can greatly reduce the memory used. but this is not friendly for the 
ls operation.

Do we have any other improvement for this? Can we additionally store the child 
ID for each record in directory table? That can help us quickly find the child 
file or child folder.

 
{quote}Associating a lock with each parent prefix being accessed by an 
operation in the OM, is sufficient to control concurrent operations on the same 
prefix. When the OM starts to process create “/a/b/c/1.txt”, a prefix lock is 
taken for “/a/b/c”...
{quote}
For the concurrency control, we create the lock for each parent prefix level. 
There will be large number of lock instances to be maintained in OM memory once 
there are millions of directory folders. Current way is so fine-grained lock 
way, have we considered about the partition namespace way? Divided the whole 
namespace into logic sub-namespaces by the prefix key. Then each sub-namespace 
will have its lock. This is a compromise approach than just having a global 
exclusive lock or having uncontrollable number of locks that depended on parent 
prefix's number.

Is there a future plan to have a way(API or command Tool) to convert object key 
to Ozone FS namespace? Because object store is now the major use case for the 
users. Maybe users want to use a filesystem way to access the data without 
moving their data.


was (Author: linyiqun):
Hi [~sdeka], I am reading for this design doc, some comments from me:

For the Filesystem Namespace Operations, the ls(list files/folders) operation 
will also a common operation. But under current implementation, for example, 
list a directory, we have to traverse the whole directory/file table to lookup 
the child file/sub-folders.This is an ineffective way. Do we have any other 
improvement for this? Can we additionally store the child ID for each record in 
directory table? That can help us quickly find the child file or child folder.

{quote}
Associating a lock with each parent prefix being accessed by an operation in 
the OM, is sufficient to control concurrent operations on the same prefix. When 
the OM starts to process create “/a/b/c/1.txt”, a prefix lock is taken for 
“/a/b/c”...
{quote}

For the concurrency control, we create the lock for each parent prefix level. 
There will be large number of lock instances to be maintained in OM memory once 
there are millions of directory folders. Current way is so fine-grained lock 
way, have we considered about the partition namespace way? Divided the whole 
namespace into logic sub-namespaces by the prefix key. Then each sub-namespace 
will have its lock. This is a compromise approach than just having a global 
exclusive lock or having uncontrollable number of locks that depended on parent 
prefix's number.

Is there a future plan to have a way(API or command Tool) to convert object key 
to Ozone FS namespace? Because object store is now  the major use case for the 
users. Maybe users want to use a filesystem way to access the data without 
moving their data.



> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Supratim Deka
>Priority: Major
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-01-29 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025842#comment-17025842
 ] 

Yiqun Lin commented on HDDS-2939:
-

Hi [~sdeka], I am reading for this design doc, some comments from me:

For the Filesystem Namespace Operations, the ls(list files/folders) operation 
will also a common operation. But under current implementation, for example, 
list a directory, we have to traverse the whole directory/file table to lookup 
the child file/sub-folders.This is an ineffective way. Do we have any other 
improvement for this? Can we additionally store the child ID for each record in 
directory table? That can help us quickly find the child file or child folder.

{quote}
Associating a lock with each parent prefix being accessed by an operation in 
the OM, is sufficient to control concurrent operations on the same prefix. When 
the OM starts to process create “/a/b/c/1.txt”, a prefix lock is taken for 
“/a/b/c”...
{quote}

For the concurrency control, we create the lock for each parent prefix level. 
There will be large number of lock instances to be maintained in OM memory once 
there are millions of directory folders. Current way is so fine-grained lock 
way, have we considered about the partition namespace way? Divided the whole 
namespace into logic sub-namespaces by the prefix key. Then each sub-namespace 
will have its lock. This is a compromise approach than just having a global 
exclusive lock or having uncontrollable number of locks that depended on parent 
prefix's number.

Is there a future plan to have a way(API or command Tool) to convert object key 
to Ozone FS namespace? Because object store is now  the major use case for the 
users. Maybe users want to use a filesystem way to access the data without 
moving their data.



> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Supratim Deka
>Priority: Major
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2927) Cache EndPoint tasks instead of creating them all the time in RunningDatanodeState

2020-01-22 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2927:

Summary: Cache EndPoint tasks instead of creating them all the time in 
RunningDatanodeState  (was: Cache EndPoint tasks instead of creating them all 
the time)

> Cache EndPoint tasks instead of creating them all the time in 
> RunningDatanodeState
> --
>
> Key: HDDS-2927
> URL: https://issues.apache.org/jira/browse/HDDS-2927
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we create EndPoint tasks all the time. This is an inefficient way, 
> we could cache these task as TODO comment suggested.
> {code}
>   //TODO : Cache some of these tasks instead of creating them
>   //all the time.
>   private Callable
>   getEndPointTask(EndpointStateMachine endpoint) {
> switch (endpoint.getState()) {
> case GETVERSION:
>   return new VersionEndpointTask(endpoint, conf, context.getParent()
>   .getContainer());
> case REGISTER:
>   return  RegisterEndpointTask.newBuilder()
>   .setConfig(conf)
>   .setEndpointStateMachine(endpoint)
>   .setContext(context)
>   .setDatanodeDetails(context.getParent().getDatanodeDetails())
>   .setOzoneContainer(context.getParent().getContainer())
>   .build();
> case HEARTBEAT:
>   return HeartbeatEndpointTask.newBuilder()
>   .setConfig(conf)
>   .setEndpointStateMachine(endpoint)
>   .setDatanodeDetails(context.getParent().getDatanodeDetails())
>   .setContext(context)
>   .build();
> case SHUTDOWN:
>   break;
> default:
>   throw new IllegalArgumentException("Illegal Argument.");
>  }
> return null;
>}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2927) Cache EndPoint tasks instead of creating them all the time

2020-01-22 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2927:

Status: Patch Available  (was: Open)

> Cache EndPoint tasks instead of creating them all the time
> --
>
> Key: HDDS-2927
> URL: https://issues.apache.org/jira/browse/HDDS-2927
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we create EndPoint tasks all the time. This is an inefficient way, 
> we could cache these task as TODO comment suggested.
> {code}
>   //TODO : Cache some of these tasks instead of creating them
>   //all the time.
>   private Callable
>   getEndPointTask(EndpointStateMachine endpoint) {
> switch (endpoint.getState()) {
> case GETVERSION:
>   return new VersionEndpointTask(endpoint, conf, context.getParent()
>   .getContainer());
> case REGISTER:
>   return  RegisterEndpointTask.newBuilder()
>   .setConfig(conf)
>   .setEndpointStateMachine(endpoint)
>   .setContext(context)
>   .setDatanodeDetails(context.getParent().getDatanodeDetails())
>   .setOzoneContainer(context.getParent().getContainer())
>   .build();
> case HEARTBEAT:
>   return HeartbeatEndpointTask.newBuilder()
>   .setConfig(conf)
>   .setEndpointStateMachine(endpoint)
>   .setDatanodeDetails(context.getParent().getDatanodeDetails())
>   .setContext(context)
>   .build();
> case SHUTDOWN:
>   break;
> default:
>   throw new IllegalArgumentException("Illegal Argument.");
>  }
> return null;
>}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2927) Cache EndPoint tasks instead of creating them all the time

2020-01-22 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-2927:
---

 Summary: Cache EndPoint tasks instead of creating them all the time
 Key: HDDS-2927
 URL: https://issues.apache.org/jira/browse/HDDS-2927
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Yiqun Lin
Assignee: Yiqun Lin


Currently, we create EndPoint tasks all the time. This is an inefficient way, 
we could cache these task as TODO comment suggested.

{code}
  //TODO : Cache some of these tasks instead of creating them
  //all the time.
  private Callable
  getEndPointTask(EndpointStateMachine endpoint) {
switch (endpoint.getState()) {
case GETVERSION:
  return new VersionEndpointTask(endpoint, conf, context.getParent()
  .getContainer());
case REGISTER:
  return  RegisterEndpointTask.newBuilder()
  .setConfig(conf)
  .setEndpointStateMachine(endpoint)
  .setContext(context)
  .setDatanodeDetails(context.getParent().getDatanodeDetails())
  .setOzoneContainer(context.getParent().getContainer())
  .build();
case HEARTBEAT:
  return HeartbeatEndpointTask.newBuilder()
  .setConfig(conf)
  .setEndpointStateMachine(endpoint)
  .setDatanodeDetails(context.getParent().getDatanodeDetails())
  .setContext(context)
  .build();
case SHUTDOWN:
  break;
default:
  throw new IllegalArgumentException("Illegal Argument.");
 }
return null;
   }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2910) OzoneManager startup failure with throwing unhelpful exception message

2020-01-19 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2910:

Status: Patch Available  (was: Open)

> OzoneManager startup failure with throwing unhelpful exception message
> --
>
> Key: HDDS-2910
> URL: https://issues.apache.org/jira/browse/HDDS-2910
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Testing for the OM HA feature, I update the HA specific configurations and 
> then start up the OM service. But I find OM is not startup succeed and I 
> check the log, I get this error info and no any other helpful message.
> {noformat}
> ...
> 2020-01-17 08:57:55,210 [main] INFO   - registered UNIX signal handlers 
> for [TERM, HUP, INT]
> 2020-01-17 08:57:55,846 [main] INFO   - ozone.om.internal.service.id is 
> not defined, falling back to ozone.om.service.ids to find serviceID for 
> OzoneManager if it is HA enabled cluster
> 2020-01-17 08:57:55,872 [main] INFO   - Found matching OM address with 
> OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: 
> lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872
> 2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
> ozone.om.http-address with value of key ozone.om.http-address.omNode-1: 
> lyq-m1-xx.xx.xx.xx:9874
> 2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
> ozone.om.https-address with value of key ozone.om.https-address.omNode-1: 
> lyq-m1-xx.xx.xx.xx:9875
> 2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
> ozone.om.address with value of key ozone.om.address.omNode-1: 
> lyq-m1-xx.xx.xx.xx:9862
> OM not initialized.
> 2020-01-17 08:57:55,887 [shutdown-hook-0] INFO   - SHUTDOWN_MSG:
> {noformat}
> "OM not initialized" doesn't give me enough info, then I have to check the 
> related logic code. Finally, I find I have made a mistake that I forgot to do 
> the om --init command first before startup the OM.
> We can additionally add with suggestion here that will help users quickly 
> know the error and how to resolved that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2910) OzoneManager startup failure with throwing unhelpful exception message

2020-01-19 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2910:

Description: 
Testing for the OM HA feature, I update the HA specific configurations and then 
start up the OM service. But I find OM is not startup succeed and I check the 
log, I get this error info and no any other helpful message.

{noformat}
...
2020-01-17 08:57:55,210 [main] INFO   - registered UNIX signal handlers for 
[TERM, HUP, INT]
2020-01-17 08:57:55,846 [main] INFO   - ozone.om.internal.service.id is not 
defined, falling back to ozone.om.service.ids to find serviceID for 
OzoneManager if it is HA enabled cluster
2020-01-17 08:57:55,872 [main] INFO   - Found matching OM address with 
OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: 
lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872
2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
ozone.om.http-address with value of key ozone.om.http-address.omNode-1: 
lyq-m1-xx.xx.xx.xx:9874
2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
ozone.om.https-address with value of key ozone.om.https-address.omNode-1: 
lyq-m1-xx.xx.xx.xx:9875
2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
ozone.om.address with value of key ozone.om.address.omNode-1: 
lyq-m1-xx.xx.xx.xx:9862
OM not initialized.
2020-01-17 08:57:55,887 [shutdown-hook-0] INFO   - SHUTDOWN_MSG:
{noformat}

"OM not initialized" doesn't give me enough info, then I have to check the 
related logic code. Finally, I find I have made a mistake that I forgot to do 
the om --init command first before startup the OM.

We can additionally add with suggestion here that will help users quickly know 
the error and how to resolved that.

  was:
Testing for the OM HA feature, I update the HA specific configurations and then 
start up the OM service. But I find OM is not startup succeed and I check the 
log, I get this error info and no any other helpful message.

{noformat}
...
2020-01-17 08:57:55,210 [main] INFO   - registered UNIX signal handlers for 
[TERM, HUP, INT]
2020-01-17 08:57:55,846 [main] INFO   - ozone.om.internal.service.id is not 
defined, falling back to ozone.om.service.ids to find serviceID for 
OzoneManager if it is HA enabled cluster
2020-01-17 08:57:55,872 [main] INFO   - Found matching OM address with 
OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: 
lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872
2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
ozone.om.http-address with value of key ozone.om.http-address.omNode-1: 
lyq-m1-xx.xx.xx.xx:9874
2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
ozone.om.https-address with value of key ozone.om.https-address.omNode-1: 
lyq-m1-xx.xx.xx.xx:9875
2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
ozone.om.address with value of key ozone.om.address.omNode-1: 
lyq-m1-xx.xx.xx.xx:9862
OM not initialized.
2020-01-17 08:57:55,887 [shutdown-hook-0] INFO   - SHUTDOWN_MSG:
{noformat}

"OM not initialized" doesn't give me enough info, then I have to check the 
related logic code. Finally, I find I have made a mistake that I forgot to do 
the om --init command first before startup the OM.

We can additionally add with suggestion here that will help users quickly know 
the error and how to fix that.


> OzoneManager startup failure with throwing unhelpful exception message
> --
>
> Key: HDDS-2910
> URL: https://issues.apache.org/jira/browse/HDDS-2910
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> Testing for the OM HA feature, I update the HA specific configurations and 
> then start up the OM service. But I find OM is not startup succeed and I 
> check the log, I get this error info and no any other helpful message.
> {noformat}
> ...
> 2020-01-17 08:57:55,210 [main] INFO   - registered UNIX signal handlers 
> for [TERM, HUP, INT]
> 2020-01-17 08:57:55,846 [main] INFO   - ozone.om.internal.service.id is 
> not defined, falling back to ozone.om.service.ids to find serviceID for 
> OzoneManager if it is HA enabled cluster
> 2020-01-17 08:57:55,872 [main] INFO   - Found matching OM address with 
> OMServiceId: om-service-test, OMNodeId: omNode-1, RPC Address: 
> lyq-m1-xx.xx.xx.xx:9862 and Ratis port: 9872
> 2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
> ozone.om.http-address with value of key ozone.om.http-address.omNode-1: 
> lyq-m1-xx.xx.xx.xx:9874
> 2020-01-17 08:57:55,872 [main] INFO   - Setting configuration key 
> ozone.om.https-address with value of key ozone.om.https-address.omNode-1: 
> 

[jira] [Commented] (HDDS-2031) Choose datanode for pipeline creation based on network topology

2019-12-16 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997406#comment-16997406
 ] 

Yiqun Lin commented on HDDS-2031:
-

I am learning the topology awareness in Ozone. What's the status of this JIRA? 
This is the useful change of this feature I think.

> Choose datanode for pipeline creation based on network topology
> ---
>
> Key: HDDS-2031
> URL: https://issues.apache.org/jira/browse/HDDS-2031
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Sammi Chen
>Assignee: Sammi Chen
>Priority: Major
>
> There are regular heartbeats between datanodes in a pipeline. Choose 
> datanodes based on network topology, to guarantee data reliability and reduce 
> heartbeat network traffic latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2690) Improve the command usage of audit parser tool

2019-12-07 Thread Yiqun Lin (Jira)
Yiqun Lin created HDDS-2690:
---

 Summary: Improve the command usage of audit parser tool
 Key: HDDS-2690
 URL: https://issues.apache.org/jira/browse/HDDS-2690
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.4.0
Reporter: Yiqun Lin


I did the test for ozone-0.4, but found audit parser tool is not so friendly to 
use for user. I input -h option to get help usage, then only this few messages:
{noformat}
Usage: ozone auditparser [-hV] [--verbose] [-D=]... 
 [COMMAND]
Shell parser for Ozone Audit Logs
 Existing or new .db file
  --verboseMore verbose output. Show the stack trace of the errors.
  -D, --set=

  -h, --help   Show this help message and exit.
  -V, --versionPrint version information and exit.
Commands:
  load, l  Load ozone audit log files
  template, t  Execute template query
  query, q Execute custom query
{noformat}
Although it shows me 3 options we can use, but I still not know the complete 
command to execute, like how to load audit to db, which available template I 
can use?


 Then I have to get detailed usage from tool doc then come back to execute 
command, it's not effective to use. It will be better to add some necessary 
usage (table structure, available templates, ...)info in command usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org