Re: which database for gene alignment data ?

Frank Austin Nothaft Wed, 10 Jun 2015 07:22:42 -0700

Hi Roni,

These are exposed as public APIs. If you want, you can run them inside of the 
adam-shell (which is just a wrapper for the spark shell, but with the ADAM 
libraries on the class path).


> Also , I need to save all my intermediate data.  Seems like ADAM stores data 
> in Parquet on HDFS.
> I want to save something in an external database, so that  we can re-use the 
> saved data in multiple ways by multiple people. 


The Parquet data can be accessed via Hive, Spark SQL, Impala, etc. 
Additionally, from ADAM, you can export most data out to legacy genomics 
formats. I’m not sure though if we support that right now for feature data; 
those are fairly new.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Jun 9, 2015, at 9:21 PM, roni <roni.epi...@gmail.com> wrote:

> Hi Frank,
> Thanks for the reply. I downloaded ADAM and built it but it does not seem to 
> list this function for command line options.
> Are these exposed as public API and I can call it from code ?
> 
> Also , I need to save all my intermediate data.  Seems like ADAM stores data 
> in Parquet on HDFS.
> I want to save something in an external database, so that  we can re-use the 
> saved data in multiple ways by multiple people. 
> Any suggestions on the DB selection or keeping data centralized for use by 
> multiple distinct groups?
> Thanks
> -Roni
> 
> 
> 
> On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft <fnoth...@berkeley.edu> 
> wrote:
> Hi Roni,
> 
> We have a full suite of genomic feature parsers that can read BED, 
> narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM  
> Additionally, we have support for efficient overlap joins (query 3 in your 
> email below). You can load the genomic features with 
> ADAMContext.loadFeatures. We have two tools for the overlap computation: you 
> can use a BroadcastRegionJoin if one of the datasets you want to overlap is 
> small or a ShuffleRegionJoin if both datasets are large.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
> On Jun 8, 2015, at 9:39 PM, roni <roni.epi...@gmail.com> wrote:
> 
>> Sorry for the delay.
>> The files (called .bed files) have format like - 
>> Chromosome start  end    feature score  strand 
>> chr1  713776  714375  peak.1  599    +
>> chr1  752401  753000  peak.2  599    +
>> The mandatory fields are 
>> 
>> chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or 
>> scaffold (e.g. scaffold10671).
>> chromStart - The starting position of the feature in the chromosome or 
>> scaffold. The first base in a chromosome is numbered 0.
>> chromEnd - The ending position of the feature in the chromosome or scaffold. 
>> The chromEnd base is not included in the display of the feature. For 
>> example, the first 100 bases of a chromosome are defined as chromStart=0, 
>> chromEnd=100, and span the bases numbered 0-99.
>> There can be more data as described - 
>> https://genome.ucsc.edu/FAQ/FAQformat.html#format1
>> Many times the use cases are like 
>> 1. find the features between given start and end positions
>> 2.Find features which have overlapping start and end points with another 
>> feature.
>> 3. read external (reference) data which will have similar format (chr10      
>> 48514785        49604641        MAPK8   49514785        +) and find all the 
>> data points which are overlapping with the other  .bed files.
>> 
>> The data is huge. .bed files can range from .5 GB to 5 gb (or more)
>> I was thinking of using cassandra, but not sue if the overlapping queries 
>> can be supported and will be fast enough.
>> 
>> Thanks for the help
>> -Roni
>> 
>> On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> Can you describe your use case in a bit more detail since not all people on 
>> this mailing list are familiar with gene sequencing alignments data ?
>> 
>> Thanks
>> 
>> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi...@gmail.com> wrote:
>> I want to use spark for reading compressed .bed file for reading gene 
>> sequencing alignments data. 
>> I want to store bed file data in db and then use external gene expression 
>> data to find overlaps etc, which database is best for it ?
>> Thanks
>> -Roni
>> 
>> 
>> 
> 
>

Re: which database for gene alignment data ?

Reply via email to