Hi Pouya,

Sorry that command didn't work as advertised -- the misinfo was my bad.  
Looking more closely at our build documentation, there were actually two 
ldHgGene commands, invoked on separate .gtf input files -- and then the .gtf 
input files were concatenated together to make the wgEncodeGencodeAutoV4.gtf.gz 
that is available for download.  

The reason for two input files and two commands: one input file had lines with 
the expected "exon" keyword in the 3rd column, while the other input file had 
"tRNAscan" as the 3rd column for all items.  The "tRNAscan" lines were silently 
ignored by ldHgGene.  To make ldHgGene see only the tRNAscan items, use this 
pipe (note the ldHgGene -nobin that suppresses the extra bin column):

zcat wgEncodeGencodeAutoV4.gtf.gz | grep -Fw tRNAscan | ldHgGene -nobin 
-exon=tRNAscan -genePredExt hg19 wgEncodeGencodeAutoV4 stdin 
-out=wgEncodeGencodeAutoV4.tRNAscan.genePred

Then append wgEncodeGencodeAutoV4.tRNAscan.genePred to 
wgEncodeGencodeAutoV4.genePred to get the complete set.  (Or better yet, sort 
to maintain ordering:  sort -k1,1 -k2n,2n wgEncodeGencodeAutoV4.genePred 
wgEncodeGencodeAutoV4.tRNAscan.genePred > 
wgEncodeGencodeAutoV4.complete.genePred)

BTW You can run gtfToGenePred, ldHgGene or almost any kent/src util on a .gz 
file, and it will recognize the suffix and decompress.  In the above example, I 
had to use zcat (gunzip -c) in order to pipe to grep, but in the earlier 
command I forgot to just call ldHgGene directly on the .gtf.gz file (and to use 
-nobin).

Hope that helps and sorry again that the previous command wasn't sufficient,

Angie


----- "Pouya Kheradpour" <[email protected]> wrote:

> From: "Pouya Kheradpour" <[email protected]>
> To: "Katrina Learned" <[email protected]>, [email protected]
> Sent: Monday, August 30, 2010 8:45:47 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [Genome] Downloading Gencode annotations in GenePred format
>
> Hi Katrina,
> 
> Thanks for your response. I think I was just confused because you had a 
> typo in your message that said "dumping the files in gtf format".
> 
> I tested the command you gave me, which actually produces the same 
> output as the command I used previously with gtfToGenePred (except with 
> an index column at the beginning). Consequently, several transcripts are 
> still missing compared to what is available on the genome browser for
> wgEncodeGencodeAutoV4.gtf.gz (described in my original email).
> 
> Thanks,
> Pouya
> 
> On 08/30/2010 11:13 AM, Katrina Learned wrote:
> > Hi Pouya,
> >
> > I am sorry if my answer wasn't clear. At this time, we do not have these
> > files in genePred format available for download. It is, however,
> > something our management is considering. The command provided in my
> > previous email should convert the files into genePred format.
> >
> > In the future, please direct your questions to the genome mailing list
> > at [email protected] -- our moderated forum for user questions and
> > discussion. You will likely get a quicker response to your question.
> >
> > Katrina Learned
> > UCSC Genome Bioinformatics Group
> >
> > Pouya Kheradpour wrote, On 08/28/10 10:36:
> >> Hi Katrina,
> >>
> >> I will look at that command, thanks. I agree about downloading and
> >> working with files locally, but where can I download the raw file in
> >> gp format (that is what I want... not gtf).
> >>
> >> Thanks!
> >> Pouya
> >>
> >> On 08/27/2010 07:03 PM, Katrina Learned wrote:
> >>> Hi Pouya,
> >>>
> >>> Please note that this Gencode Genes V4 is has not been through our QA
> >>> process, yet. So please keep that in mind as you are using the data.
> >>> One of our engineers has suggested that this command should work for
> >>> you:
> >>>
> >>> gunzip -c wgEncodeGencodeAutoV4.gtf.gz | ldHgGene-gtf -genePredExt hg19
> >>> wgEncodeGencodeAutoV4 stdin -out= wgEncodeGencodeAutoV4.genePred
> >>>
> >>> Here's the information on -out from ldHgGene usage statement:
> >>> -out=gpfile write output, in genePred format, instead of loading table.
> >>> Database is ignored.
> >>>
> >>> Regarding your question about the table browser being fixed, for large
> >>> files like this we feel that it is actually better to download them and
> >>> work with them locally. However, I have asked our management is consider
> >>> dumping the files in gtf format on our download server as we do with
> >>> many of our other tables.
> >>>
> >>> Please don't hesitate to contact the mail list again if you have
> any
> >>> further questions.
> >>>
> >>> Katrina Learned
> >>> UCSC Genome Bioinformatics Group
> >>>
> >>> Pouya Kheradpour wrote, On 08/24/10 14:08:
> >>>> My goal is to have both wgEncodeGencodeManualV4 and
> >>>> wgEncodeGencodeAutoV4 in GenePred format.
> >>>>
> >>>> I tried to download the wgEncodeGencodeManualV4 table from the
> test
> >>>> browser. For some reason when downloading it gets stuck after
> >>>> downloading and the file is cutoff (chromosomes 17-22 are
> completely
> >>>> missing, chromosome 16 is there partially; after exactly 409600b
> =
> >>>> 400kb). This happens reproducibly across multiple
> >>>> computers/networks/operating systems. I also want the
> >>>> wgEncodeGencodeAutoV4, which appears to download ok.
> >>>>
> >>>> I have had this sort of problem before (where downloads from the
> table
> >>>> browser would get stuck). I am not sure what causes them. Is
> there a
> >>>> url from which the data can always be downloaded in flat files?
> >>>>
> >>>> I can also download the gtf version of these files from:
> >>>>
> >>>>
> http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencode/wgEncodeGencode{Auto,Manual}V4.gtf.gz
> >>>>
> >>>>
> >>>>
> >>>> But when I try to convert the AutoV4 file I get several errors:
> >>>>
> >>>> gunzip -c wgEncodeGencodeAutoV4.gtf.gz | gtfToGenePred
> -allErrors
> >>>> -genePredExt /dev/stdin /dev/stdout
> >>>>
> >>>> ... [snip]
> >>>> no exons defined for 93876
> >>>> no exons defined for 93875
> >>>> no exons defined for 93874
> >>>> no exons defined for 115098
> >>>> no exons defined for 27940
> >>>> no exons defined for 29602
> >>>> no exons defined for 29603
> >>>> no exons defined for 10879
> >>>> 622 errors
> >>>>
> >>>> and these genes are missing from the final output (although they
> are
> >>>> present in the wgEncodeGencodeAutoV4 I download from the test
> browser).
> >>>>
> >>>> I was wondering what the command used to convert the gtf files
> above
> >>>> to GenePred actually was. Also, can the table browser be
> repaired?
> >>>>
> >>>> Right now I am using wgEncodeGencodeAutoV4 from the table browser
> and
> >>>> wgEncodeGencodeManualV4 converted with gtfToGenePred, but it
> would be
> >>>> nice to have a more consistent way to set it all up.
> >>>>
> >>>> Thanks,
> >>>> Pouya
> >>>>
> >>>> _______________________________________________
> >>>> Genome maillist - [email protected]
> >>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
> >>>
> >>
> >
> 
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to