Re: [Genome] filtering

Brooke Rhead Fri, 25 May 2012 16:24:34 -0700

Hi Animesh,

We recommend using the pslCDnaFilter tool.  It is available in the 
source tree:
http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads


or as an executable:
http://hgdownload.cse.ucsc.edu/admin/exe/

Run the command with no arguments to see this usage statement:

$ pslCDnaFilter
wrong # of args:  pslCDnaFilter [options] inPsl outPsl

Filter cDNA alignments in psl format.  Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.

WARNING: comparative filters requires that the input is sorted by
query name.  The command: 'sort -k 10,10' will do the trick.

Each alignment is assigned a score that is based on identity and
weighted towards longer alignments and those with introns.  This
can do either global or local best-in-genome selection.  Local
near best in genome keeps fragments of an mRNA that align in
discontinuous locations from other fragments.  It is useful for
unfinished genomes.  Global near best in genome keeps alignments
based on overall score.

Options:
    -algoHelp - print message describing the filtering algorithm.

    -localNearBest=-1.0 - local near best in genome filtering,
     keeping aligments within this fraction of the top score for
     each aligned portion of the mRNA. A value of zero keeps only
     the best for each fragment. A value of -1.0 disables
     (default).

    -globalNearBest=-1.0 - global near best in genome filtering,
     keeping aligments withing this fraction of the top score.  A
     value of zero keeps only the best alignment.  A value of -1.0
     disables (default).

    -ignoreNs - don't include Ns (repeat masked) while calculating the
     score and coverage. That is treat them as unaligned rather than
     mismatches.  Ns are still counts as mismatches when calculating
     the identity.

    -ignoreIntrons - don't favor apparent introns when scoring.

    -minId=0.0 - only keep alignments with at least this fraction
     identity.

    -minCover=0.0 - minimum fraction of query that must be
     aligned.  If -polyASizes is specified and the query is in
     the file, the ploy-A is not included in coverage
     calculation.

    -decayMinCover  -  the minimum coverage is calculated
     per alignment from the query size using the formula:
        minCoverage = 1.0 - qSize / 250.0
     and minCoverage is bounded between 0.25 and 0.9.

    -minSpan=0.0 - keep only alignments whose target length are
     at least this fraction of the longest alignment passing the
     other filters.  This can be useful for removing possible
     retroposed genes.

    -minQSize=0 - drop queries shorter than this size

    -minAlnSize=0 - minimum number of aligned bases.  This includes
     repeats, but excludes poly-A/poly-T bases if available.

    -minNonRepSize=0 - Minimum number of matching bases that are not 
repeats.
     This does not include mismatches.
     Must use -repeats on BLAT if doing unmasked alignments.

    -maxRepMatch=1.0 - Maximum fraction of matching bases
     that are repeats.  Must use -repeats on BLAT if doing
     unmasked alignments.

    -maxAligns=-1 - maximum number of alignments for a given query. If
     exceeded, then alignments are sorted by score and only this number
     will be saved.  A value of -1 disables (default)

    -polyASizes=file - tab separate file with information about
     poly-A tails and poly-T heads.  Format is outputted by
     faPolyASizes:

         id seqSize tailPolyASize headPolyTSize

    -usePolyTHead - if a poly-T head was detected and is longer
     than the poly-A tail, it is used when calculating coverage
     instead of the poly-A head.

    -bestOverlap - filter overlapping alignments, keeping the best of
     alignments that are similar.  This is designed to be used with
     overlapping, windowed alignments, where one alignment might be 
truncated.
     Does not discarding ones with weird overlap unless 
-filterWeirdOverlapped
     is specified.

    -hapRegions=psl - PSL format alignments of each haplotype 
pseudo-chromosome
     to the corresponding reference chromosome region.  This is used to map
     alignments between regions.

    -dropped=psl - save psls that were dropped to this file.

    -weirdOverlapped=psl - output weirdly overlapping PSLs to
     this file.

    -filterWeirdOverlapped - Filter weirdly overlapped alignments, keeping
     the single highest scoring one or an arbitrary one if multiple with
     the same high score.

    -alignStats=file - output the per-alignment statistics to this file

    -uniqueMapped - keep only cDNAs that are uniquely aligned after all
     other filters have been applied.

    -noValidate - don't run pslCheck validation.

    -verbose=1 - 0: quite
                 1: output stats
                 2: list problem alignment (weird or invalid)
                 3: list dropped alignments and reason for dropping
                 4: list kept psl and info
                 5: info about all PSLs

    -hapRefMapped=psl - output PSLs of haplotype to reference chromosome
     cDNA alignments mappings (for debugging purposes).

    -hapRefCDnaAlns=psl - output PSLs of haplotype cDNA to reference cDNA
     alignments (for debugging purposes).

    -alnIdQNameMode - add internal assigned alignment numbers to cDNA names
     on output.  Useful for debugging, as they are include in the verbose
     tracing as [#1], etc.  Will make a mess of normal production usage.

    -blackList=file.txt - adds a list of accession ranges to a black list.
     Any accession on this list is dropped. Black list file is two columns
     where the first column is the beginning of the range, and the second
     column is the end of the range, inclusive.


The default options don't do any filtering. If no filtering
criteria are specified, all PSLs will be passed though, except
those that are internally inconsistent.

THE INPUT MUST BE BE SORTED BY QUERY for the comparative filters.


If you have further questions, please contact us again at 
[email protected].

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 5/25/12 12:07 AM, Animesh Anand Mishra wrote:
> Dear Sir/Madam,
> I did a BLAT alignment for cDNA of Homo Sapiens which I got from UCSC
> against the refseq library of UCSC. As expected I have got more than
> one alignment for each cDNA.
> Can you suggest me some criteria on which I could further narrow down
> my hits keeping in mind that I want to search for cDNA which are
> harboring the intronic regions of refseq genes.
>
> Best,
> Animesh Anand Mishra
> Department of Biological Sciences
> IISER-Bhopal
>
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] filtering

Reply via email to