[
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925928#comment-15925928
]
Wei Zhou commented on HDFS-7343:
--------------------------------
Thanks [~andrew.wang] for these great questions. These questions are answered
based on the Phase 1 design.
{quote}
One possible staging:
{quote}
Agreed, that's indeed the core parts of this phase. I updated it in the new doc
and added in some other items.
{quote}
Could you describe how you will satisfy usecases 4 and 5 in more detail?
{quote}
These two cases will not be supported in Phase 1. They are based on the
assumption that enough metrics been collected from HDFS.
{quote}
the complete set of NameNode changes required
{quote}
The change mainly includes:
# Implement metrics collecting logical. Only file access count info need
special treatment for Phase 1.
# Provide supporting/API for SSM to get these metrics
# Implement a RocksDB. It mainly used for store metrics data during SSM
crashed, reboots. During these cases metrics data accumulates in NN for a
while. It does not required to be checkpoint into HDFS. Though NN can stores
all the data into it for other purposes.
It won't bring large change to NN logical existed.
{quote}
The lack of HA means this will be a non-starter
{quote}
Sorry for not making it clear. Hadoop HA is supported in Phase 1.
{quote}
Why are ChangeStoragePolicy and EnforceStoragePolicy separate actions?
{quote}
I'm trying to make it as basic as possible that high-level operations can be
realized by using these simple actions.
{quote}
What is the set of metrics do you plan to collect from HDFS?
{quote}
Listed in Section "Objects" phase 1 design document, their attributes are the
metrics planned to collect.
{quote}
centralized read statistics... Is there a plan to implement this?
{quote}
Yes, we implement a statistics (file access count) for file level cache.
{quote}
description of the trigger syntax
{quote}
It is an event-based mechanism, SSM will do the rule condition check when the
corresponding event happened. The event can be any events happened in HDFS, but
in phase 1 we are going to support timer events and file level events like
{{file crate}} through INotify.
{quote}
• How often does the SSM wake up to check rules?
{quote}
If a rule has some kind of trigger then SSM will check the rule when the
corresponding event happens. If the rule specifies no trigger then it will be
checked periodically. By default, we have a configurable time interval to check
these rules. An optimized way for rules that have connect with time is to check
the rule based on the time set, for example, if a rule is to check file last
modify time exceeds 30 days then, the rule only needed to be checked in day
level.
{quote}
• Could you provide a complete list of conditions that are planned?
{quote}
The condition can be any boolean expression. You can use the metrics/internal
variables provided to set up the expression you need.
{quote}
How do you plan to implement accessCount over a time range?
{quote}
Please reference the Section "Rule/Execution flow" in phase 1 design document.
{quote}
Any other new metrics or information you plan to add to HDFS as part of this
work?
{quote}
No metrics will be added into HDFS besides file access count according to
current phase 1 design.
{quote}
Prefer we use atime or ctime rather than "age", since they're more specific.
{quote}
Yes, we'll use these instead of 'age'. We thought 'age' would be more user
(those who have no idea of the low-level implementation) friendly.
{quote}
• Could you provide a complete definition of the object matching syntax?
{quote}
Listed in Section "Objects" phase 1 design document
{quote}
• Do rules support basic boolean operators like AND, OR, NOT for objects
and conditions?
{quote}
Yes, they are supported.
{quote}
• Aren't many of these matches going to require listing the complete
filesystem? Or are you planning to use HDFS inotify?
{quote}
First, the rule itself should be specific instead of over general.
Second, in SSM we can do many optimizations to decrease the needs of listing
the filesystem. From example, if a rule is 'cache file under directory /foo if
the file access count in 3 mins larger than 10', then there is no need to list
all the files under /foo, only take the files that been accessed into account.
We also use HDFS inotify to track filesystem changes and update it in SSM local
info library, this also reduces the needes to listing the filesystem for latest
states.
{quote}
• The "cache" action is underspecified, what cache pool is used?
{quote}
For simple of demonstration, we use just 'cache' (a default pool is used).
Parameters have to be specified in order to customize the behavior.
{quote}
• Can actions happen concurrently? Is there a way of limiting concurrency?
{quote}
Yes, it can happen concurrently, but also you can specify a maximum number of
concurrent instances for each kind of actions. For example, limit the
"EnforceStoragePolicy" action for performance consideration.
{quote}
• Can you run multiple actions in a rule? Is there a syntax for defining
"functions"?
{quote}
Yes, multiple actions are supported for syntax level. Currently, no syntax
provided for define a 'function' as I did not found any scenario for this in
Phase 1.
{quote}
• Are there substitutions that can be used to reference the filename
{quote}
Yes, there should be cases that need to reference it. I added the support to
the doc, thanks!
{quote}
• "? Same for DN objects, the diskbalancer needs the DN host:port.
{quote}
For the datanode that fulfills the rule conditions, SSM knows its 'host:port'
and can pass it to the command. Of course, this automatic parameter feeding
feature is only supported for predefined commands.
Operational questions:
{quote}
audit log for actions taken by the SSM?
{quote}
Yes, we keep logs for rules, user can query these information from SSM.
{quote}
• Is there a way to see when each action started, stopped, and its status?
{quote}
Yes, all this information can be demonstrated in webUI.
{quote}
• How are errors and logs from actions exposed?
{quote}
Action errors and logs are stored inside SSM. For example, cache API is called
to cache a file, the return value/exception is stored in SSM. Users can query
these information.
{quote}
• What metrics are exposed by the SSM?
{quote}
For phase 1 mainly rule-related metrics are exported:
Statistical info, state info and logging data for the rule.
The metrics needs for rule condition check. SSM polls these data from NN are
stored for history tracking and other purposes.
{quote}
• Why are there configuration options to enable individual actions? Isn't
this behavior already defined by the rules file?
{quote}
It's removed from the latest doc. It aimed to provided a global method above
rule to control the effective of actions. It'is mainly for debug and emulation
purpose.
{quote}
• Why does the SSM need a "dfs.ssm.enabled" config? Is there a usecase
for having an SSM service started, but not enabled?
{quote}
For example, to provide dynamic enable/disable SSM through web interface
feature, we need the web server of SSM keeps working.
{quote}
• Is the rules file dynamically refreshable?
{quote}
I'm not clear about the meaning 'dynamically refreshable'. Do you mean that
after rule file been modified SSM can automatically detect this change and
reload the rule file? If so, in current design we tend to not support this kind
of direct modify, users have to resubmit the modified rule to SSM through
interfaces defined.
{quote}
• What do we do if the rules file is malformed? What do we do if there
are conflicting rules or multiple matches?
{quote}
Grammar error will be found out when submitting it into SSM and the submission
fails if it have such error.
After been accepted, SSM can analysis the actions executed to find out the
potential conflicting rules or multiple matches and give some feedback through
web UI to users. For example, during a time interval, a file is cached by one
rule and moved to SSD by another rule, then it may be a potential conflicting
rule.
{quote}
• dfs.ssm.msg.datanode.interval is described as the polling interval for
the NN, typo?
{quote}
Sorry, it's a typo. I have changed it into "dfs.ssm.poll.namenode.interval".
{quote}
• What happens if multiple SSMs are accidentally started?
{quote}
We will implement a mechanism to prevent this kind of thing happen. During the
startup it checks whether there is another instance running, it exits if so.
> HDFS smart storage management
> -----------------------------
>
> Key: HDFS-7343
> URL: https://issues.apache.org/jira/browse/HDFS-7343
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Kai Zheng
> Assignee: Wei Zhou
> Attachments: HDFSSmartStorageManagement-General-20170315.pdf,
> HDFS-Smart-Storage-Management.pdf,
> HDFSSmartStorageManagement-Phase1-20170315.pdf,
> HDFS-Smart-Storage-Management-update.pdf, move.jpg
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and
> flexible storage policy engine considering file attributes, metadata, data
> temperature, storage type, EC codec, available hardware capabilities,
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution
> to provide smart storage management service in order for convenient,
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache
> facility, HSM offering, and all kinds of tools (balancer, mover, disk
> balancer and so on) in a large cluster.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]