Hi Frank, Thanks for the reply. I downloaded ADAM and built it but it does not seem to list this function for command line options. Are these exposed as public API and I can call it from code ?
Also , I need to save all my intermediate data. Seems like ADAM stores data in Parquet on HDFS. I want to save something in an external database, so that we can re-use the saved data in multiple ways by multiple people. Any suggestions on the DB selection or keeping data centralized for use by multiple distinct groups? Thanks -Roni On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft <fnoth...@berkeley.edu > wrote: > Hi Roni, > > We have a full suite of genomic feature parsers that can read BED, > narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM > <https://github.com/bigdatagenomics/adam> Additionally, we have support > for efficient overlap joins (query 3 in your email below). You can load the > genomic features with ADAMContext.loadFeatures > <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L438>. > We have two tools for the overlap computation: you can use a > BroadcastRegionJoin > <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/BroadcastRegionJoin.scala> > if > one of the datasets you want to overlap is small or a ShuffleRegionJoin > <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala> > if > both datasets are large. > > Regards, > > Frank Austin Nothaft > fnoth...@berkeley.edu > fnoth...@eecs.berkeley.edu > 202-340-0466 > > On Jun 8, 2015, at 9:39 PM, roni <roni.epi...@gmail.com> wrote: > > Sorry for the delay. > The files (called .bed files) have format like - > > Chromosome start end feature score strand > > chr1 713776 714375 peak.1 599 + > chr1 752401 753000 peak.2 599 + > > The mandatory fields are > > > 1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or > scaffold (e.g. scaffold10671). > 2. chromStart - The starting position of the feature in the chromosome or > scaffold. The first base in a chromosome is numbered 0. > 3. chromEnd - The ending position of the feature in the chromosome or > scaffold. The *chromEnd* base is not included in the display of the feature. > For example, the first 100 bases of a chromosome are defined as > *chromStart=0, chromEnd=100*, and span the bases numbered 0-99. > > There can be more data as described - > https://genome.ucsc.edu/FAQ/FAQformat.html#format1 > Many times the use cases are like > 1. find the features between given start and end positions > 2.Find features which have overlapping start and end points with another > feature. > 3. read external (reference) data which will have similar format (chr10 > 48514785 49604641 MAPK8 49514785 +) and find all the > data points which are overlapping with the other .bed files. > > The data is huge. .bed files can range from .5 GB to 5 gb (or more) > I was thinking of using cassandra, but not sue if the overlapping queries can > be supported and will be fast enough. > > Thanks for the help > -Roni > > > On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Can you describe your use case in a bit more detail since not all people >> on this mailing list are familiar with gene sequencing alignments data ? >> >> Thanks >> >> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi...@gmail.com> wrote: >> >>> I want to use spark for reading compressed .bed file for reading gene >>> sequencing alignments data. >>> I want to store bed file data in db and then use external gene >>> expression data to find overlaps etc, which database is best for it ? >>> Thanks >>> -Roni >>> >>> >> > >