[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592342#comment-15592342
 ] 

Wei Zhou commented on HDFS-7343:
--------------------------------

Continue on comments from [~andrew.wang],
{quote}
Could you talk a little bit more about the rules solver? What happens when a 
rule cannot be satisfied?
{quote}
A rule is a declaration which defines actions to be implied on some objects 
under certain condition. It’s a guideline for SSM to function. Rule solver 
parses a rule and takes the action specified if its predefined condition 
fulfilled. But this does not mean that the action will be executed physically. 
It depends on many factors. For example, the amount of memory available has to 
be checked before caching a file, if not enough memory available then the 
action will be canceled. 
>From above a rule is essentially a hint for SSM. When a rule cannot be 
>satisfied SSM can write some logs with the reason to log 
>file/console/dashboard. Admin can check those info for further processing.
{quote}
improve average throughput, but not improve end-to-end execution time (the SLO 
metric).
{quote}
SSM pays more attention on the efficiency of the whole cluster than a 
particular workload, it may not improve the end-to-end execution time of one 
workload but it may improve another workload in the cluster. Another case is 
that it won’t help for a CPU intensive workload though we do make some 
optimization to IO. To make SSM work better, we could expose some interface for 
workloads to provide hints to SSM. 
{quote}
Also on the rules solver, how do we quantify the cost of executing an action? 
It's important to avoid unnecessarily migrating data back and forth.
{quote}
It’s very hard to quantify the cost generally in a dynamic environment. Moving 
hot data to faster storage may impact the performance now but may boost it 
later. What we do now is trying to minimize the cost based on access history, 
current status of the cluster, rules and other mechanisms like hints from user. 
Restrict conditions have to be full filled (rules, cluster states, history, 
hints etc.) before actually executing an action. Generally, the greater the 
cost, the harder the conditions. For example, Actions like archive file and 
balance the cluster may depends higher on the rules or user’s hint compared 
with actions like cache a file. Yes, it's very important to avoid unnecessarily 
migrating data back and forth and SSM tries to minimize it at the very 
beginning.
{quote}
Could you talk some more about the value of Kafka in this architecture, 
compared to a naive implementation that just polls the NN and DN for 
information? 
HDFS's inotify mechanism might also be interesting here.
{quote}
Please reference the reply to #3 question from [~anu] also. For SSM:
1. It’s a message collector for SSM. It provides a high efficiency and reliable 
way for nodes to send messages out. If all the nodes send out messages to SSM 
directly then it’s very hard for SSM to handling issues such as message 
buffering, persist to avoid losing messages, unstable service time due to too 
many nodes and etc. It decouples the SSM from the cluster and let it can focus 
on the message processing logic.
2. It’s a message recorder for SSM. If SSM stopped by user or crashed while the 
HDFS cluster is still working, then messages from nodes can be stored in Kafka. 
These messages are good material for SSM to warm up quickly. Without Kafka then 
these precious data will be lost. It makes SSM more robust.
{quote}
Also wondering if with Kafka we still need a periodic snapshot of state, since 
Kafka is just a log.
{quote}
SSM snapshots the data digested from those raw logs and other info managed, but 
raw logs themselves are not stored. The data to be snapshotted is essential for 
SSM to function better. 
{quote}
The doc talks a lot about improving performance, but I think the more important 
usecase is actually saving cost by migrating data to archival or EC storage. 
This is because of the above difficulties surrounding actually understanding 
application-level performance with just FS-level information.
{quote}
Agreed that it’s an important use case and impossible for SSM itself to improve 
the performance for all cases as you mentioned. But it’s a trend that DNs will 
have larger memory and faster storage. How to make use of these hardwares to 
improve the performance is also an important issue to solve. For example, 
[[email protected]] and I did a [study on HSM | 
http://blog.cloudera.com/blog/2016/06/new-study-evaluating-apache-hbase-performance-on-modern-storage-media/]
 last year, we found that the throughput of cluster with 4 SSDs + 4 HDDs on 
each DN is 1.36x larger than cluster with 8 HDDs on each DN, it’s almost as 
good as cluster with 8 SSDs on each DN. It’s also the same case for latency. So 
it should improve the performance by using the fast storage efficiently.  I 
think we need more investigation and efforts to enhance SSM’s capability on 
this aspect. Maybe we could also provide user-friendly APIs for user to use 
cache and fast storage more actively.
{quote}
So, some simple rules with time-based triggers or looking at file atimes might 
get us 80% of what users want.
{quote}
Thanks for the helpful information! It’s a good suggestion to use the 
time-based triggers on the facilities in HDFS! Maybe SSM can also works very 
well with these simple rules.


> HDFS smart storage management
> -----------------------------
>
>                 Key: HDFS-7343
>                 URL: https://issues.apache.org/jira/browse/HDFS-7343
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kai Zheng
>            Assignee: Wei Zhou
>         Attachments: HDFS-Smart-Storage-Management.pdf
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to