Hi Roni, We have a full suite of genomic feature parsers that can read BED, narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM Additionally, we have support for efficient overlap joins (query 3 in your email below). You can load the genomic features with ADAMContext.loadFeatures. We have two tools for the overlap computation: you can use a BroadcastRegionJoin if one of the datasets you want to overlap is small or a ShuffleRegionJoin if both datasets are large.
Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 8, 2015, at 9:39 PM, roni <roni.epi...@gmail.com> wrote: > Sorry for the delay. > The files (called .bed files) have format like - > Chromosome start end feature score strand > chr1 713776 714375 peak.1 599 + > chr1 752401 753000 peak.2 599 + > The mandatory fields are > > chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold > (e.g. scaffold10671). > chromStart - The starting position of the feature in the chromosome or > scaffold. The first base in a chromosome is numbered 0. > chromEnd - The ending position of the feature in the chromosome or scaffold. > The chromEnd base is not included in the display of the feature. For example, > the first 100 bases of a chromosome are defined as chromStart=0, > chromEnd=100, and span the bases numbered 0-99. > There can be more data as described - > https://genome.ucsc.edu/FAQ/FAQformat.html#format1 > Many times the use cases are like > 1. find the features between given start and end positions > 2.Find features which have overlapping start and end points with another > feature. > 3. read external (reference) data which will have similar format (chr10 > 48514785 49604641 MAPK8 49514785 +) and find all the > data points which are overlapping with the other .bed files. > > The data is huge. .bed files can range from .5 GB to 5 gb (or more) > I was thinking of using cassandra, but not sue if the overlapping queries can > be supported and will be fast enough. > > Thanks for the help > -Roni > > On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you describe your use case in a bit more detail since not all people on > this mailing list are familiar with gene sequencing alignments data ? > > Thanks > > On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi...@gmail.com> wrote: > I want to use spark for reading compressed .bed file for reading gene > sequencing alignments data. > I want to store bed file data in db and then use external gene expression > data to find overlaps etc, which database is best for it ? > Thanks > -Roni > > >