Help me explain --demux

Ole Tange Thu, 14 Feb 2019 16:29:06 -0800

I have implemented --demux.

I need help explaining what this is and how to use it.


--demultiplex or --demux splits out a stream of records and give each
record to a process based on the hash value of a column.

The idea is that you have a file with columns (e.g. a CSV-file or an
apache access.log). You want this file processed by your program PRG.
PRG works great but is slow. PRG also requires all lines of a given
customer to be given to the same process, but PRG is fine by getting
multiple customers. Also it does not matter which process gets the
customer. The only restriction is that all lines for a given customer
must be given to this one process and only this one process.

It is a bit similar to SQL's GROUP BY combined with --round-robin. So
you select the column with the customer ID, then GNU Parallel will
make sure that all IDs of a given value is always passed to the same
process.

It could for example be used for splitting on domain name: All lines
for a given domain will be given to the same process; multiple domains
may be given to this process, but the same domain name will never be
given to another process.

Example:

ID, other, stuff
A, foo, bar
B, baz, quux
C, xyzzy, qux
A, quz, corge
C, foobar, corge
A, plug, grault

I want all A's to go to one job slot, all B's to go to one job slot,
and all C's to go to one job slot.

If I start 2 job slot, then two values will go to the same job slot
(maybe B and C) - and that is fine.

  cat my.csv | parallel --pipe --demux 1 --colsep , -j2 PRG

--demux 1 tells GNU Parallel that column 1 is to be used as the ID.

Help me explain this more clearly.


/Ole

Help me explain --demux

Reply via email to