Hi Roni,

We have a full suite of genomic feature parsers that can read BED, narrowPeak, 
GATK interval lists, and GTF/GFF into Spark RDDs in ADAM  Additionally, we have 
support for efficient overlap joins (query 3 in your email below). You can load 
the genomic features with ADAMContext.loadFeatures. We have two tools for the 
overlap computation: you can use a BroadcastRegionJoin if one of the datasets 
you want to overlap is small or a ShuffleRegionJoin if both datasets are large.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Jun 8, 2015, at 9:39 PM, roni <roni.epi...@gmail.com> wrote:

> Sorry for the delay.
> The files (called .bed files) have format like - 
> Chromosome start  end    feature score  strand 
> chr1   713776  714375  peak.1  599    +
> chr1   752401  753000  peak.2  599    +
> The mandatory fields are 
> 
> chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold 
> (e.g. scaffold10671).
> chromStart - The starting position of the feature in the chromosome or 
> scaffold. The first base in a chromosome is numbered 0.
> chromEnd - The ending position of the feature in the chromosome or scaffold. 
> The chromEnd base is not included in the display of the feature. For example, 
> the first 100 bases of a chromosome are defined as chromStart=0, 
> chromEnd=100, and span the bases numbered 0-99.
> There can be more data as described - 
> https://genome.ucsc.edu/FAQ/FAQformat.html#format1
> Many times the use cases are like 
> 1. find the features between given start and end positions
> 2.Find features which have overlapping start and end points with another 
> feature.
> 3. read external (reference) data which will have similar format (chr10       
> 48514785        49604641        MAPK8   49514785        +) and find all the 
> data points which are overlapping with the other  .bed files.
> 
> The data is huge. .bed files can range from .5 GB to 5 gb (or more)
> I was thinking of using cassandra, but not sue if the overlapping queries can 
> be supported and will be fast enough.
> 
> Thanks for the help
> -Roni
> 
> On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> Can you describe your use case in a bit more detail since not all people on 
> this mailing list are familiar with gene sequencing alignments data ?
> 
> Thanks
> 
> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi...@gmail.com> wrote:
> I want to use spark for reading compressed .bed file for reading gene 
> sequencing alignments data. 
> I want to store bed file data in db and then use external gene expression 
> data to find overlaps etc, which database is best for it ?
> Thanks
> -Roni
> 
> 
> 

Reply via email to