Hi Frank,
Thanks for the reply. I downloaded ADAM and built it but it does not seem
to list this function for command line options.
Are these exposed as public API and I can call it from code ?

Also , I need to save all my intermediate data.  Seems like ADAM stores
data in Parquet on HDFS.
I want to save something in an external database, so that  we can re-use
the saved data in multiple ways by multiple people.
Any suggestions on the DB selection or keeping data centralized for use by
multiple distinct groups?
Thanks
-Roni



On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft <fnoth...@berkeley.edu
> wrote:

> Hi Roni,
>
> We have a full suite of genomic feature parsers that can read BED,
> narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM
> <https://github.com/bigdatagenomics/adam>  Additionally, we have support
> for efficient overlap joins (query 3 in your email below). You can load the
> genomic features with ADAMContext.loadFeatures
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L438>.
> We have two tools for the overlap computation: you can use a
> BroadcastRegionJoin
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/BroadcastRegionJoin.scala>
>  if
> one of the datasets you want to overlap is small or a ShuffleRegionJoin
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala>
>  if
> both datasets are large.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
>
> On Jun 8, 2015, at 9:39 PM, roni <roni.epi...@gmail.com> wrote:
>
> Sorry for the delay.
> The files (called .bed files) have format like -
>
> Chromosome start  end    feature score  strand
>
> chr1   713776  714375  peak.1  599    +
> chr1   752401  753000  peak.2  599    +
>
> The mandatory fields are
>
>
>    1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or 
> scaffold (e.g. scaffold10671).
>    2. chromStart - The starting position of the feature in the chromosome or 
> scaffold. The first base in a chromosome is numbered 0.
>    3. chromEnd - The ending position of the feature in the chromosome or 
> scaffold. The *chromEnd* base is not included in the display of the feature. 
> For example, the first 100 bases of a chromosome are defined as 
> *chromStart=0, chromEnd=100*, and span the bases numbered 0-99.
>
> There can be more data as described - 
> https://genome.ucsc.edu/FAQ/FAQformat.html#format1
> Many times the use cases are like
> 1. find the features between given start and end positions
> 2.Find features which have overlapping start and end points with another 
> feature.
> 3. read external (reference) data which will have similar format (chr10       
> 48514785        49604641        MAPK8   49514785        +) and find all the 
> data points which are overlapping with the other  .bed files.
>
> The data is huge. .bed files can range from .5 GB to 5 gb (or more)
> I was thinking of using cassandra, but not sue if the overlapping queries can 
> be supported and will be fast enough.
>
> Thanks for the help
> -Roni
>
>
> On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Can you describe your use case in a bit more detail since not all people
>> on this mailing list are familiar with gene sequencing alignments data ?
>>
>> Thanks
>>
>> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi...@gmail.com> wrote:
>>
>>> I want to use spark for reading compressed .bed file for reading gene
>>> sequencing alignments data.
>>> I want to store bed file data in db and then use external gene
>>> expression data to find overlaps etc, which database is best for it ?
>>> Thanks
>>> -Roni
>>>
>>>
>>
>
>

Reply via email to