It's not quite a year since the open-sourcing of eBay's tsv utilities. Since then there have been a number of additions and updates, and the tools form a more complete package. The tools assist with manipulation of tabular data files common in machine learning and data mining environments. They work alongside traditional Unix command line tools like 'cut', and 'sort'. They also fit well with data mining and stats packages like R and Pandas.

The tools include filtering, slicing, joins and other manipulation, sampling, and statistical calculations. If you find yourself working with large data files from a unix shell, you may like these tools.

Speed matters when processing large data files, and these tools are fast. I've published new benchmarks comparing the tools to similar tools written in several native compiled programming languages. The tools are the fastest on five of the six benchmarks run, generally by significant margins. It's a good result for the D programming language. The benchmarks may be of interest regardless of your interest in the tools themselves.

Repository: https://github.com/eBay/tsv-utils-dlang
Performance benchmarks: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md

--Jon

Reply via email to