Re: suggestion for new option: --block-break

Achim Gratz Fri, 03 May 2019 23:26:00 -0700

Ole Tange writes:
>> > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3 
>> > $_=substr($_,-2,2)'
>>
>> I'm not sure what that "3" is doing there - some character transliteration 
>> problem in our email?.
>
> 3 is column 3. So $_ will contain the value in column 3. If no number
> given, then $_ is the full line.
>
> This will make it slightly harder distinguishing between a named
> column or some perl code. But I think it is OK to assume:
>
> * --block-breaks value contains only [a-z0-9_] and --header : is set
> => Named column
> * perl code otherwise


I think it should be an interesting extension of parallel indeed.  If I
gather the OP's requirements right, the column he wants to do the block
break on is producing a continous row section.  I'm not familiar with
the data formats of genomics, but I believe that some of them might even
have fixed line lengths.  That would allow for a binary search to figure
out the break point before going into the blocking algo, which would be
a net win if the number of blocks to read for the preprocessing is a
small fraction of the total blocks only.

If so, it really would be a preprocessing step to run before entering
parallel and the extension to parallel would be to enable handing off a
list of blocks (that parallel may further split) to it.

> Yeah, I really do not like the name --block-breaks. I like --group-by
> a little better, but not 100% happy with that either.

Or --scatter / --split(-*)?


Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Factory and User Sound Singles for Waldorf rackAttack:
http://Synth.Stromeko.net/Downloads.html#WaldorfSounds

Re: suggestion for new option: --block-break

Reply via email to