An ideal IDS dataset will be fully diverse (in terms of type of attacks) and completely free of artifacts (incurred during creation and pre-processing). However, ideal scenarios do not hold in real-life! -- if they do then they will not be real...
I agree that it is very hard to obtain datasets with payloads due to privacy constraints. Good anonymization procedures mostly retain the relative statistics of the data. For example, you may consult the following work by people at ICSI. http://www.icir.org/enterprise-tracing/devil-ccr-jan06.pdf An overwhelming majority of network based IDSs use only spatial information present in packet headers. The datasets that I have mentioned in my earlier post can be used to evaluate such IDSs. Moreover, you can find details of the endpoint worm propagation dataset in the following papers: http://www.nexginrc.org/papers/tr15-zubair.pdf http://www.nexginrc.org/papers/gecco08-zubair.pdf In my view, there are two directions to take dataset labeling further: 1. Improving injection procedures to ensure minimization of artifacts. This is more feasible if you know all parameters and environmental conditions during trace collection -- Know Thy Data. 2. Use "semi-automated" ~ "semi-manual" procedures. @Stefano: You have probably missed this point. Semi-automated procedures still require manual intervention, however, it will help to reduce its magnitude significantly. So, we are not exactly developing a typical anomaly detection system. let me know what you think.
