[galaxy-dev] Tool Development DELLY

2015-01-09 Thread Marco Albuquerque
Hello Galaxy,

I'm currently working on adding some tools and am having an issue with
DELLY.

So, I am under the impression that BAM indexing happens automatically when a
BAM is uploaded. However there is no associated dataset_i.dat.bai file in
the file location in my local instance of galaxy.

There is however metadata which seems to be created but they are not being
linked together. What I mean to say is DELLY errors with cannot find Bam
Index.

I was curious if there is a specific way developers are suppose to work
around this? Basically we want to avoid having to provide both the BAM and
BAM index and symbolically linking a new dataset because we know you have
already created a better implementation, we just want to use it and don't
know how.

Any help is greatly appreciated,

Marco


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Fasta Datatype Adjustments

2015-01-19 Thread Marco Albuquerque
Hello Galaxy,

I have a small dilemma regarding the fasta datatype.

Currently, to my knowledge, the fasta datatype does not specify any
metadata. I was curious how I should go about changing the fasta datatype if
I wanted to include the .fai and .dict as metadata? Basically I want to
avoid unnecessary recreating of the same file done automatically by MuTect.
It seems very inefficient to have to produce these files every time a user
were to call MuTect in galaxy (I.e. MuTect can't find them, so it makes them
itself). 

I guess my question is, what are the consequences to adjusting the current
implementation of the fasta datatype? Will users be able to pull this tool
easily? (I.e. When one uploads a tool and that tools uses a different fasta
definition, how does galaxy handle this?) Could this new fasta declaration
potentially be adapted by galaxy? Should I just define a new mtfasta
datatype special for MuTect purposes?

Let me know if you have any advice,

Marco


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Parallelism using metadata

2015-03-02 Thread Marco Albuquerque
Hello Galaxy Dev,

I have a question regarding parallelism on a BAM file.

I have currently implemented 3 split options for the BAM datatype

1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on  a defined bp length,
and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads
encountered (if multiple files, all other files match the interval as the
first)

Now, as you can imagine, reading and writing large BAM files is a pain, and
I personally think this is not the best solution for Galaxy.
What I was hoping to implement (but don't know how) is to create a new
metadata option in bam (bam.metadata.bam_interval) which would generate the
interval without creating a new file (essentially, I would create a symbolic
link to the old large file, and then update the metadata.bam_interval, this
would contain some string of the form chrom:start-end which could then be
used in a variety of tools which accept an interval as an option (for
example samtools view))

This would be far more efficient then my first implementation, but the thing
I don't know how to do is specify some kind of metadata at the split level.
I was hoping maybe you could direct me to an example that does this?

I have added the following to my metadata.py file:

class IntervalParameter( MetadataParamter )
 
def __init__( self, spec ):
MetadataParamter.__init__( self, spec ):
self.rname = self.spec.get( "rname" )
self.start = self.spec.get( "start" )
self.end = self.spec.get( "end" )

def to_string(self):
if self.rname = 'all':
return ''
else:
return ''.join([self.rname, ':', self.start, '-', self.end])

And the following to my binary.py file:

### UNDER THE BAM CLASS

MetadataElement( name="bam_interval", desc="BAM Interval",
param=metadata.IntervalParameter, rname="all", start="", end="",
visible=False, optional=True)


I somehow want rname="all" to be the default, but upon parallelism, I want
to be able to adjust this parameter in the split functions.

So,



Would actually change the metadata of each file, and not create sub-bams.


PLEASE HELP!!!

Marco


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Parallelism using metadata

2015-03-03 Thread Marco Albuquerque
Hi John,

Thanks for your reply.

I think for the time being, I will simply create a tool that creates an
interval file, and the parallelize on this interval file.
Though I agree, this would be a useful feature to include but I don't
think I am anywhere near ready to start dabbling in galaxy's
core filesystem as I have only been developing for a short while. Im
hoping that I can learn more about how galaxy works at the GCC,
and maybe then I will know how to efficiently adjust galaxy code.

I am actually hoping to present my work at the GCC15, I am part of a
project that is adding a variety of Cancer Genomic Tools into galaxy.

And thanks so much for the resources, I will surely look into all of these.

Much appreciated,

Marco




On 2015-03-02 9:22 AM, "John Chilton"  wrote:

>Hey Marco,
>
>Thanks for the e-mail. This is an awesome idea, but I am worried it is
>very hard to do this well in Galaxy. If you create symbolic links to
>the original file - then Galaxy might delete the original file and the
>derived files would all break without warning. Galaxy does have this
>separate concept of datasets and history dataset associations so that
>a dataset can exist in more than one place simultaneously - and one
>could imagine sticking this metadata there and just sort of
>dynamically splitting up the BAM file whenever it is used in a tool or
>served out over the API - but this would be a large effort and would
>require all sorts of modifications to various parts of Galaxy.
>
>Something worth looking at is this work by Kyle Ellrott:
>
>https://bitbucket.org/galaxy/galaxy-central/pull-request/175/parameter-bas
>ed-bam-file-parallelization
>
>This was a very localized effort to just work some of the ideas just
>into the task splitting framework in Galaxy. This has the advantage of
>not needing to mess with metadata and datasets, etc all the way up
>the chain.
>
>Kyle has abandoned that approach however, but it is promising start to
>something like this I think - and it would be much less disruptive
>than doing this with metadata and datasets (though admittedly more
>limited as well).
>
>If this will be primarily used for workflows - there are a couple of
>recent developments that might make splitting more feasible. Dannon
>introduced the ability to delete intermediate outputs from workflows a
>few releases ago - and the upcoming release (15.03) will introduce the
>ability to write tools that split up a single input into a collection.
>The existing workflow and dataset collection framework can then apply
>normal tools over every element of the collection and you can write a
>tool to merge the results. More information can be found here -
>https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-t
>o-explicitly-produce-dataset.
>
>These common pipelines where you split up a BAM files, run a bunch of
>steps, and then merge the results will be executable in the near
>future (though 15.03 won't have workflow editor support for it - I
>will try to get to this by the following release - and you can
>manually build up workflows to do this -
>https://bitbucket.org/galaxy/galaxy-central/src/0468d285f89c799559926c94f3
>00c42d05e8c47a/test/api/test_workflows.py?at=default#cl-544).
>
>Thanks again,
>-John
>
>
>On Fri, Feb 27, 2015 at 10:04 PM, Marco Albuquerque
> wrote:
>> Hello Galaxy Dev,
>>
>> I have a question regarding parallelism on a BAM file.
>>
>> I have currently implemented 3 split options for the BAM datatype
>>
>> 1) by_rname -> splits the bam into files based on the chromosome
>> 2) by_interval -> splits the bam into files based on  a defined bp
>>length,
>> and does so across the entire genome present in the BAM file
>> 3) by_read -> splits the bam into files based on the number of reads
>> encountered (if multiple files, all other files match the interval as
>>the
>> first)
>>
>> Now, as you can imagine, reading and writing large BAM files is a pain,
>>and
>> I personally think this is not the best solution for Galaxy.
>> What I was hoping to implement (but don't know how) is to create a new
>> metadata option in bam (bam.metadata.bam_interval) which would generate
>>the
>> interval without creating a new file (essentially, I would create a
>>symbolic
>> link to the old large file, and then update the metadata.bam_interval,
>>this
>> would contain some string of the form chrom:start-end which could then
>>be
>> used in a variety of tools which accept an interval as an option (for
>> example samtools view))
>>
>> This would be far more efficient then my first implementation, but the
>>thing
>&

Re: [galaxy-dev] Workflow 'Input Dataset' Ressolution Issue

2015-10-11 Thread Marco Albuquerque
Well, my Cloudman galaxy instance is installing 15.07 revision 990501d2d9.

Is there maybe a different revision I should be using?

Marco

On Sun, Oct 11, 2015 at 11:03 AM, John Chilton  wrote:

> This sounds like it corresponds to this issue
> https://github.com/galaxyproject/galaxy/issues/776. Is it possible
> that upgrading Galaxy to the latest 15.07 fixes the issue?
>
> On Thu, Oct 8, 2015 at 7:40 PM, Marco Albuquerque
>  wrote:
> > Hello Galaxy Dev,
> >
> > Consider the workflow found in 'Strelka_Workflow_Sequential' image
> >
> > Here it is in a different view
> 'Strelka_Sequential_Workflow_Different_View'
> >
> > Notice That the first input dataset links to the fetch_interval tool and
> the
> > first preprocess tool. And the second input dataset links only to the
> second
> > preprocess tool.
> >
> > When you run the workflow, what actually happens is the first input
> dataset
> > goes to the fetch_interval tools and the second input dataset goes to
> both
> > preprocessing tools. (Look at final two images). They have essentially
> > executed the same tool twice. Where it should be an execution with a
> normal
> > file and an execution with a tumour file.
> >
> > Why is this happening? What can I do to fix this?
> >
> > Thanks,
> >
> > Marco Albuquerque
> >
> > ___
> > Please keep all replies on the list by using "reply all"
> > in your mail client.  To manage your subscriptions to this
> > and other Galaxy lists, please use the interface at:
> >   https://lists.galaxyproject.org/
> >
> > To search Galaxy mailing lists use the unified search at:
> >   http://galaxyproject.org/search/mailinglists/
>
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Workflow 'Input Dataset' Ressolution Issue

2015-10-11 Thread Marco Albuquerque
So in other words, there is no release that I can update my cloudman
instance to?

Marco

On Sun, Oct 11, 2015 at 3:49 PM, John Chilton  wrote:

> Yeah - release_15.07 is a moving target that bug fixes get added to.
> 990501d2d9 doesn't seem to have this fix yet.
>
> On Sun, Oct 11, 2015 at 5:09 PM, Marco Albuquerque
>  wrote:
> > Well, my Cloudman galaxy instance is installing 15.07 revision
> 990501d2d9.
> >
> > Is there maybe a different revision I should be using?
> >
> > Marco
> >
> > On Sun, Oct 11, 2015 at 11:03 AM, John Chilton 
> wrote:
> >>
> >> This sounds like it corresponds to this issue
> >> https://github.com/galaxyproject/galaxy/issues/776. Is it possible
> >> that upgrading Galaxy to the latest 15.07 fixes the issue?
> >>
> >> On Thu, Oct 8, 2015 at 7:40 PM, Marco Albuquerque
> >>  wrote:
> >> > Hello Galaxy Dev,
> >> >
> >> > Consider the workflow found in 'Strelka_Workflow_Sequential' image
> >> >
> >> > Here it is in a different view
> >> > 'Strelka_Sequential_Workflow_Different_View'
> >> >
> >> > Notice That the first input dataset links to the fetch_interval tool
> and
> >> > the
> >> > first preprocess tool. And the second input dataset links only to the
> >> > second
> >> > preprocess tool.
> >> >
> >> > When you run the workflow, what actually happens is the first input
> >> > dataset
> >> > goes to the fetch_interval tools and the second input dataset goes to
> >> > both
> >> > preprocessing tools. (Look at final two images). They have essentially
> >> > executed the same tool twice. Where it should be an execution with a
> >> > normal
> >> > file and an execution with a tumour file.
> >> >
> >> > Why is this happening? What can I do to fix this?
> >> >
> >> > Thanks,
> >> >
> >> > Marco Albuquerque
> >> >
> >> > ___
> >> > Please keep all replies on the list by using "reply all"
> >> > in your mail client.  To manage your subscriptions to this
> >> > and other Galaxy lists, please use the interface at:
> >> >   https://lists.galaxyproject.org/
> >> >
> >> > To search Galaxy mailing lists use the unified search at:
> >> >   http://galaxyproject.org/search/mailinglists/
> >
> >
>
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/