Github user cestella commented on the issue: https://github.com/apache/incubator-metron/pull/467 The performance penalties are minimal. The number of files will equal the number of reducers, which does not scale with the data, and user specifiable. Also we are just sorting the file handles here, not the contents, so OOM errors are very unlikely. The contents are sorted by virtue of MapReduce, the files are named in an ordered way by virtue of our custom partitioner, this just ensures that the files are processed in order. I'm not treating this as just a test problem. This is a problem of our assumptions not being correct. This could be a problem for the real pcap system, not just the test, if people are using non-HDFS implementation. For HDFS, it's probably not an issue (I'm not even sure of that in all cases, honestly and there is no guarantee for the behavior to change since it's not mandated), but I'd rather own our assumptions rather than depend on Filesystem operations which do not conform to our assumptions necessarily.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---