[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925928#comment-15925928
 ] 

Wei Zhou commented on HDFS-7343:
--------------------------------

Thanks [~andrew.wang] for these great questions. These questions are answered 
based on the Phase 1 design.

{quote}
One possible staging:
{quote}
Agreed, that's indeed the core parts of this phase. I updated it in the new doc 
and added in some other items.

{quote}
Could you describe how you will satisfy usecases 4 and 5 in more detail?
{quote}
These two cases will not be supported in Phase 1. They are based on the 
assumption that enough metrics been collected from HDFS.

{quote}
the complete set of NameNode changes required
{quote}
The change mainly includes:
# Implement metrics collecting logical. Only file access count info need 
special treatment for Phase 1.
# Provide supporting/API for SSM to get these metrics
# Implement a RocksDB. It mainly used for store metrics data during SSM 
crashed, reboots. During these cases metrics data accumulates in NN for a 
while. It does not required to be checkpoint into HDFS. Though NN can stores 
all the data into it for other purposes.

It won't bring large change to NN logical existed.

{quote}
The lack of HA means this will be a non-starter
{quote}
Sorry for not making it clear. Hadoop HA is supported in Phase 1.

{quote}
Why are ChangeStoragePolicy and EnforceStoragePolicy separate actions? 
{quote}
I'm trying to make it as basic as possible that high-level operations can be 
realized by using these simple actions.

{quote}
What is the set of metrics do you plan to collect from HDFS?
{quote}
Listed in Section "Objects" phase 1 design document, their attributes are the 
metrics planned to collect.

{quote}
centralized read statistics... Is there a plan to implement this?
{quote}
Yes, we implement a statistics (file access count) for file level cache.

{quote}
description of the trigger syntax
{quote}
It is an event-based mechanism, SSM will do the rule condition check when the 
corresponding event happened. The event can be any events happened in HDFS, but 
in phase 1 we are going to support timer events and file level events like 
{{file crate}} through INotify.

{quote}
•       How often does the SSM wake up to check rules?
{quote}
If a rule has some kind of trigger then SSM will check the rule when the 
corresponding event happens. If the rule specifies no trigger then it will be 
checked periodically. By default, we have a configurable time interval to check 
these rules. An optimized way for rules that have connect with time is to check 
the rule based on the time set, for example, if a rule is to check file last 
modify time exceeds 30 days then, the rule only needed to be checked in day 
level.

{quote}
•       Could you provide a complete list of conditions that are planned?
{quote}
The condition can be any boolean expression. You can use the metrics/internal 
variables provided to set up the expression you need.

{quote}
How do you plan to implement accessCount over a time range?
{quote}
Please reference the Section "Rule/Execution flow" in phase 1 design document.

{quote}
Any other new metrics or information you plan to add to HDFS as part of this 
work?
{quote}
No metrics will be added into HDFS besides file access count according to 
current phase 1 design.

{quote}
Prefer we use atime or ctime rather than "age", since they're more specific.
{quote}
Yes, we'll use these instead of 'age'. We thought 'age' would be more user 
(those who have no idea of the low-level implementation) friendly.

{quote}
•       Could you provide a complete definition of the object matching syntax?
{quote}
Listed in Section "Objects" phase 1 design document

{quote}
•       Do rules support basic boolean operators like AND, OR, NOT for objects 
and conditions?
{quote}
Yes, they are supported.

{quote}
•       Aren't many of these matches going to require listing the complete 
filesystem? Or are you planning to use HDFS inotify?
{quote}
First, the rule itself should be specific instead of over general.
Second, in SSM we can do many optimizations to decrease the needs of listing 
the filesystem. From example, if a rule is 'cache file under directory /foo if 
the file access count in 3 mins larger than 10', then there is no need to list 
all the files under /foo, only take the files that been accessed into account.
We also use HDFS inotify to track filesystem changes and update it in SSM local 
info library, this also reduces the needes to listing the filesystem for latest 
states. 

{quote}
•       The "cache" action is underspecified, what cache pool is used?
{quote}
For simple of demonstration, we use just 'cache' (a default pool is used). 
Parameters have to be specified in order to customize the behavior.

{quote}
•       Can actions happen concurrently? Is there a way of limiting concurrency?
{quote}
Yes, it can happen concurrently, but also you can specify a maximum number of 
concurrent instances for each kind of actions. For example, limit the 
"EnforceStoragePolicy" action for performance consideration.

{quote}
•       Can you run multiple actions in a rule? Is there a syntax for defining 
"functions"?
{quote}
Yes, multiple actions are supported for syntax level. Currently, no syntax 
provided for define a 'function' as I did not found any scenario for this in 
Phase 1.

{quote}
•       Are there substitutions that can be used to reference the filename
{quote}
Yes, there should be cases that need to reference it. I added the support to 
the doc, thanks!

{quote}
•       "? Same for DN objects, the diskbalancer needs the DN host:port.
{quote}
For the datanode that fulfills the rule conditions, SSM knows its 'host:port' 
and can pass it to the command. Of course, this automatic parameter feeding 
feature is only supported for predefined commands.

Operational questions:
{quote}
audit log for actions taken by the SSM?
{quote}
Yes, we keep logs for rules, user can query these information from SSM.

{quote}
•       Is there a way to see when each action started, stopped, and its status?
{quote}
Yes, all this information can be demonstrated in webUI.

{quote}
•       How are errors and logs from actions exposed?
{quote}
Action errors and logs are stored inside SSM. For example, cache API is called 
to cache a file, the return value/exception is stored in SSM. Users can query 
these information.

{quote}
•       What metrics are exposed by the SSM?
{quote}
For phase 1 mainly rule-related metrics are exported:
Statistical info, state info and logging data for the rule.
The metrics needs for rule condition check. SSM polls these data from NN are 
stored for history tracking and other purposes.

{quote}
•       Why are there configuration options to enable individual actions? Isn't 
this behavior already defined by the rules file?
{quote}
It's removed from the latest doc. It aimed to provided a global method above 
rule to control the effective of actions. It'is mainly for debug and emulation 
purpose. 

{quote}
•       Why does the SSM need a "dfs.ssm.enabled" config? Is there a usecase 
for having an SSM service started, but not enabled?
{quote}
For example, to provide dynamic enable/disable SSM through web interface 
feature, we need the web server of SSM keeps working.

{quote}
•       Is the rules file dynamically refreshable?
{quote}
I'm not clear about the meaning 'dynamically refreshable'. Do you mean that 
after rule file been modified SSM can automatically detect this change and 
reload the rule file? If so, in current design we tend to not support this kind 
of direct modify, users have to resubmit the modified rule to SSM through 
interfaces defined.

{quote}
•       What do we do if the rules file is malformed? What do we do if there 
are conflicting rules or multiple matches?
{quote}
Grammar error will be found out when submitting it into SSM and the submission 
fails if it have such error.
After been accepted, SSM can analysis the actions executed to find out the 
potential conflicting rules or multiple matches and give some feedback through 
web UI to users. For example, during a time interval, a file is cached by one 
rule and moved to SSD by another rule, then it may be a potential conflicting 
rule.

{quote}
•       dfs.ssm.msg.datanode.interval is described as the polling interval for 
the NN, typo?
{quote}
Sorry, it's a typo. I have changed it into "dfs.ssm.poll.namenode.interval".

{quote}
•       What happens if multiple SSMs are accidentally started?
{quote}
We will implement a mechanism to prevent this kind of thing happen. During the 
startup it checks whether there is another instance running, it exits if so.

> HDFS smart storage management
> -----------------------------
>
>                 Key: HDFS-7343
>                 URL: https://issues.apache.org/jira/browse/HDFS-7343
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kai Zheng
>            Assignee: Wei Zhou
>         Attachments: HDFSSmartStorageManagement-General-20170315.pdf, 
> HDFS-Smart-Storage-Management.pdf, 
> HDFSSmartStorageManagement-Phase1-20170315.pdf, 
> HDFS-Smart-Storage-Management-update.pdf, move.jpg
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to