I have implemented --demux. I need help explaining what this is and how to use it.
--demultiplex or --demux splits out a stream of records and give each record to a process based on the hash value of a column. The idea is that you have a file with columns (e.g. a CSV-file or an apache access.log). You want this file processed by your program PRG. PRG works great but is slow. PRG also requires all lines of a given customer to be given to the same process, but PRG is fine by getting multiple customers. Also it does not matter which process gets the customer. The only restriction is that all lines for a given customer must be given to this one process and only this one process. It is a bit similar to SQL's GROUP BY combined with --round-robin. So you select the column with the customer ID, then GNU Parallel will make sure that all IDs of a given value is always passed to the same process. It could for example be used for splitting on domain name: All lines for a given domain will be given to the same process; multiple domains may be given to this process, but the same domain name will never be given to another process. Example: ID, other, stuff A, foo, bar B, baz, quux C, xyzzy, qux A, quz, corge C, foobar, corge A, plug, grault I want all A's to go to one job slot, all B's to go to one job slot, and all C's to go to one job slot. If I start 2 job slot, then two values will go to the same job slot (maybe B and C) - and that is fine. cat my.csv | parallel --pipe --demux 1 --colsep , -j2 PRG --demux 1 tells GNU Parallel that column 1 is to be used as the ID. Help me explain this more clearly. /Ole
