Re: [Bioc-devel] BSgenome changes

2020-08-14 Thread Hervé Pagès

Hi Felix,

On 8/13/20 21:43, Felix Ernst wrote:

Hi Leonard, Hi Herve,

I followed your conversation, since I have noticed the same problem. Thanks, 
Herve, for the explanation of the recent changes on hg19.

The GRCh37.P13 report states in its last line:

MT  assembled-molecule  MT  Mitochondrion   J01415.2=   
NC_012920.1 non-nuclear 16569   chrM

Since the last name is called "UCSC-style-name", wouldn't that mean that chrM 
has to be renamed to MT and not chrMT?


This is a mistake in the sequence report for GRCh37.p13. GRCh37.p13:MT 
is the same as hg19:chrMT, not hg19:chrM.


hg19:chrM and hg19:chrMT are **not** the same sequences. The former is 
NC_001807 and has length 16571 and the latter is NC_012920.1 and has 
length 16569.


Yes, seqlevelsStyle() is sorting out all this mess for you ;-)

Cheers,
H.



Thanks again for the explanation.

Cheers,
Felix

-Ursprüngliche Nachricht-
Von: Bioc-devel  Im Auftrag von Hervé Pagès
Gesendet: Freitag, 14. August 2020 01:08
An: Leonard Goldstein ; bioc-devel@r-project.org
Cc: charlotte.sone...@fmi.ch
Betreff: Re: [Bioc-devel] BSgenome changes

Hi Leonard,

On 8/12/20 15:22, Leonard Goldstein via Bioc-devel wrote:

Dear Bioc team,

I'm following up on this recent GitHub issue
. Please see the issue for more details and code examples.

It looks like changes in Bioc devel result in two copies of the
mitochondrial chromosome for BSgenome.Hsapiens.UCSC.hg19 -- one named
chrM like in previous package versions (length 16571) and one named
chrMT (length 16569).

When using seqlevelsStyle() to change chromosome names from UCSC to
NCBI format, this results in new behavior -- in the past chrM was
simply renamed MT, now the different sequence chrMT is used. Is this intended?


Absolutely intended.

There is a long story behind the unfortunate fate of the mitochondrial 
chromosome in hg19. I'll try to keep it short.

When the UCSC folks released the hg19 browser more than 10 years ago, they 
based it on assembly GRCh37:


https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F01405.13=DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558=jWtgKVQGC-SQp6i4prhKBiD5cBh2kEc8R1gL2uPlzy0=

See sequence report for GRCh37:

  
https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F01405.13-5FGRCh37_GCF-5F01405.13-5FGRCh37-5Fassembly-5Freport.txt=DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558=2mzBk6ksCERabHcDIy7tR6p1aQvFGkLM8lZNrsWrA18=


For some mysterious reason GRCh37 didn't include the mitochondrial chromosome 
so the UCSC folks decided to use mitochondrial sequence
NC_001807 and called it chrM.

However, UCSC has recently decided to base hg19 on GRCh37.p13 instead of 
GRCh37. A rather surprising move after many years of hg19 being based on the 
latter.


https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F01405.25_=DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558=gxOOdwtmHjZfz-EAFblY0cm-7upZ9useI3sEgDD87o8=

See sequence report for GRCh37.p13:

  
https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F01405.25-5FGRCh37.p13_GCF-5F01405.25-5FGRCh37.p13-5Fassembly-5Freport.txt=DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558=epUg7bSfwCEF_WUOPlT5hPmLXHY7V51Mau09UaQNB5o=


Note that GRCh37.p13 does include the mitochondrial chromosome. It's called MT 
in the official sequence report above and chrMT in hg19.

At the same time the UCSC folks decided to keep chrM so now hg19 contains 2 
mitochondrial sequences: chrM and chrMT. Previously it has only one: chrM.

So what you see in BioC devel in BSgenome.Hsapiens.UCSC.hg19 and with
seqlevelsStyle(genome) is only reflecting this. In particular
seqlevelsStyle(genome) <- "NCBI" now does the following:

- Rename chrMT -> MT.

- chrM does NOT get renamed. There is no point in renaming this sequence 
because it has no equivalent in GRCh37.p13.

Hope this helps,

H.



Leonard

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
man_listinfo_bioc-2Ddevel=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeA
vimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYv
fbojaqTJZVg=IczvesjTwEkPQVlFX5wKSJLUHyjNHE0sk71a-kMAVEI=



--

Re: [Bioc-devel] Bioconductor Package Submission - Removing git pack objects

2020-08-14 Thread Nitesh Turaga
Hi,

Please make sure the bare repository is updated with the “removed” artifacts of 
the .pack files. From what I understand, the .pack files should also be removed 
by the BFG cleaner. 

Try and follow this 
https://stackoverflow.com/questions/11050265/remove-large-pack-file-created-by-git
 
.
 There might a command which “might” help, 

git for-each-ref --format='delete %(refname)' refs/original | git 
update-ref —stdin
git reflog expire --expire=now --all
git gc --aggressive --prune=now
You seem to be doing everything that we would already recommend a novice user 
with the same issue. 

Try this and if it doesn’t work, I’ll investigate more.

Best,

Nitesh 


> On Aug 14, 2020, at 5:51 AM, Joseph Lee Jing Xian  
> wrote:
> 
> To whom it may concern:
> 
> I am Joseph, writing on behalf of the developers of 
> proActiv, a package used to infer 
> promoter activity from RNA-seq data.
> We are in the process of preparing the package for Biconductor submission. So 
> far, the package has cleared R CMD check with no errors or warnings, and 
> cleared R CMD BiocCheck with no errors. However, we're still getting one 
> warning from R CMD BiocCheck regarding individual file size. In particular, 
> we have a couple of offending files (.bed, .rda), one of them being a git 
> pack object (.pack).
> We have followed the suggested pipeline to remove large files with BFG 
> Repo-cleaner:
> 
>> git clone --mirror https://github.com/GoekeLab/proActiv.git
> 
> 
>> java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 5M --no-blob-protection 
>> proActiv.git
> 
>> cd proActiv.git
> 
>> git reflog expire --expire=now --all && git gc --prune=now --aggressive
> 
> This removes the individual files (e.g. .bed, .rda) in commit history that 
> were bigger than the stipulated 5M limit, as expected.
> However, cloning the package locally from the bare repository and running R 
> CMD BiocCheck on it still throws the same warning, but with the git pack 
> object as the only offending file.
> How should one go about dealing with hidden git pack objects so that the 
> Bioconductor checks can be passed successfully?
> 
> Thanks,
> Joseph
> 
>   [[alternative HTML version deleted]]
> 
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Bioconductor Package Submission - Removing git pack objects

2020-08-14 Thread Manders-2, F.M.
Hi Joseph,

When you run 'R CMD build' you should get a release tarball, which shouldn't 
contain this .git folder. If you perform the 'R CMD check' on the tarball, you 
shouldn't get this warning. If you still get the warning, or have other files 
that generate this warning, then you could try to add them to a '.Rbuildignore' 
file.

I hope this helps,
Freek Manders

On 14/08/2020, 13:26, "Bioc-devel on behalf of Joseph Lee Jing Xian" 
 wrote:

To whom it may concern:

I am Joseph, writing on behalf of the developers of 
proActiv, a package used to infer 
promoter activity from RNA-seq data.
We are in the process of preparing the package for Biconductor submission. 
So far, the package has cleared R CMD check with no errors or warnings, and 
cleared R CMD BiocCheck with no errors. However, we're still getting one 
warning from R CMD BiocCheck regarding individual file size. In particular, we 
have a couple of offending files (.bed, .rda), one of them being a git pack 
object (.pack).
We have followed the suggested pipeline to remove large files with BFG 
Repo-cleaner:

> git clone --mirror https://github.com/GoekeLab/proActiv.git


> java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 5M 
--no-blob-protection proActiv.git

> cd proActiv.git

> git reflog expire --expire=now --all && git gc --prune=now --aggressive

This removes the individual files (e.g. .bed, .rda) in commit history that 
were bigger than the stipulated 5M limit, as expected.
However, cloning the package locally from the bare repository and running R 
CMD BiocCheck on it still throws the same warning, but with the git pack object 
as the only offending file.
How should one go about dealing with hidden git pack objects so that the 
Bioconductor checks can be passed successfully?

Thanks,
Joseph

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Bioconductor Package Submission - Removing git pack objects

2020-08-14 Thread Joseph Lee Jing Xian
To whom it may concern:

I am Joseph, writing on behalf of the developers of 
proActiv, a package used to infer 
promoter activity from RNA-seq data.
We are in the process of preparing the package for Biconductor submission. So 
far, the package has cleared R CMD check with no errors or warnings, and 
cleared R CMD BiocCheck with no errors. However, we're still getting one 
warning from R CMD BiocCheck regarding individual file size. In particular, we 
have a couple of offending files (.bed, .rda), one of them being a git pack 
object (.pack).
We have followed the suggested pipeline to remove large files with BFG 
Repo-cleaner:

> git clone --mirror https://github.com/GoekeLab/proActiv.git


> java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 5M --no-blob-protection 
> proActiv.git

> cd proActiv.git

> git reflog expire --expire=now --all && git gc --prune=now --aggressive

This removes the individual files (e.g. .bed, .rda) in commit history that were 
bigger than the stipulated 5M limit, as expected.
However, cloning the package locally from the bare repository and running R CMD 
BiocCheck on it still throws the same warning, but with the git pack object as 
the only offending file.
How should one go about dealing with hidden git pack objects so that the 
Bioconductor checks can be passed successfully?

Thanks,
Joseph

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel