RE: Parallelising grep

Cook, Malcolm Fri, 09 Aug 2013 15:00:45 -0700

Assuming your shell is bash....

With this exported function


function slice {
# PURPOSE: After an optional -h lines of header (which are echoed
# unless supressed with <-sh>), echo every <-n>th line (default:
# every 1 line) starting with the <-m>th (counting from 1, starting
# with the first line after the header, default: starting with the
# <n-1>th line.)
# AUTHOR: [email protected]
# EXAMPLE: slice -h=1 -sh -n=5 foo.tab > 
foo_every_fifth_line_after_the_one_line_header.tab
# set -e ;
perl -snwe 'BEGIN{our $n||=1; our $m=($n) unless defined($m); $m-=1; our 
$h||=0; die "required: m < n" unless $m < $n; our $sh} print $_ if (($. > $h ) 
? (($. -1 - $h) % $n == $m) : ! $sh)' -- $@
}
export -f slice


...you can create a parallel jobs where each job greps a slice of in.bam

You would pass parallels {#} as the value for -m and the same value you pass as 
-j to parallel as the value for -n

You'll probably need to use parallels -q and have each job call bash.

The following is untested.

parallel -j 10 -q 'bash -c "samtools view in.bam | slice -n=10 -m={#} | bash -c 
fgrep -w -f read.ids"' > alignments.txt

The output will have the slices interwoven.




From: [email protected] 
[mailto:[email protected]] On Behalf Of Nathan S. 
Watson-Haigh
Sent: Friday, August 09, 2013 12:54 AM
To: [email protected]
Subject: Parallelising grep

I have a SAM/BAM file and I'd like to grep for alignments of certain reads IDs. 
I have the read ID strings in another file. I'm currently doing this with:
$ samtools view in.bam | fgrep -w -f read.ids > alignments.txt

Is it possible to parallelise the grep by having each grep process a different 
subset of read iDs from the read.ids file? Or is there an alternative way to 
parallelise this which I have overlooked?

Cheers,
Nathan


--
Nathan S. Watson-Haigh, PhD
Research Fellow in Bioinformatics

[Description: Description: Description: logo1a4Signature]

Australian Centre for Plant Functional Genomics (ACPFG)
School of Agriculture, Food and Wine
University of Adelaide Waite Campus
Plant Genomics Centre
Hartley Grove, Urrbrae
SA 5064

Phone:                  +61 8 8313 2046
Mobile:                +61 438 711 615
Skype:                   nathanhaigh<skype:nathanhaigh?call>
Email:                  
[email protected]<mailto:[email protected]>
Web:                     http://www.acpfg.com.au/bioinformatics
LinkedIn               http://www.linkedin.com/profile/view?id=114191748

Github:                 https://github.com/nathanhaigh/
                                https://gist.github.com/nathanhaigh/
Twitter:                @watsonhaigh<https://twitter.com/watsonhaigh>
                                @BIG_SA1<https://twitter.com/BIG_SA1>
RID:                        
B-9833-2008<http://www.researcherid.com/rid/B-9833-2008>
ResearchGate:  
Nathan_Watson-Haigh<https://www.researchgate.net/profile/Nathan_Watson-Haigh/>


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

<<inline: image001.gif>>

RE: Parallelising grep

Reply via email to