Ole Tange writes: >> > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3 >> > $_=substr($_,-2,2)' >> >> I'm not sure what that "3" is doing there - some character transliteration >> problem in our email?. > > 3 is column 3. So $_ will contain the value in column 3. If no number > given, then $_ is the full line. > > This will make it slightly harder distinguishing between a named > column or some perl code. But I think it is OK to assume: > > * --block-breaks value contains only [a-z0-9_] and --header : is set > => Named column > * perl code otherwise
I think it should be an interesting extension of parallel indeed. If I gather the OP's requirements right, the column he wants to do the block break on is producing a continous row section. I'm not familiar with the data formats of genomics, but I believe that some of them might even have fixed line lengths. That would allow for a binary search to figure out the break point before going into the blocking algo, which would be a net win if the number of blocks to read for the preprocessing is a small fraction of the total blocks only. If so, it really would be a preprocessing step to run before entering parallel and the extension to parallel would be to enable handing off a list of blocks (that parallel may further split) to it. > Yeah, I really do not like the name --block-breaks. I like --group-by > a little better, but not 100% happy with that either. Or --scatter / --split(-*)? Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Factory and User Sound Singles for Waldorf rackAttack: http://Synth.Stromeko.net/Downloads.html#WaldorfSounds
