Sorry for the delay.
The files (called .bed files) have format like -

Chromosome start  end    feature score  strand

chr1     713776  714375  peak.1  599    +
chr1     752401  753000  peak.2  599    +

The mandatory fields are


   1. chrom - The name of the chromosome (e.g. chr3, chrY,
chr2_random) or scaffold (e.g. scaffold10671).
   2. chromStart - The starting position of the feature in the
chromosome or scaffold. The first base in a chromosome is numbered 0.
   3. chromEnd - The ending position of the feature in the chromosome
or scaffold. The *chromEnd* base is not included in the display of the
feature. For example, the first 100 bases of a chromosome are defined
as *chromStart=0, chromEnd=100*, and span the bases numbered 0-99.

There can be more data as described -
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Many times the use cases are like
1. find the features between given start and end positions
2.Find features which have overlapping start and end points with
another feature.
3. read external (reference) data which will have similar format
(chr10  48514785        49604641        MAPK8   49514785        +) and find all 
the data
points which are overlapping with the other  .bed files.

The data is huge. .bed files can range from .5 GB to 5 gb (or more)
I was thinking of using cassandra, but not sue if the overlapping
queries can be supported and will be fast enough.

Thanks for the help
-Roni


On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Can you describe your use case in a bit more detail since not all people
> on this mailing list are familiar with gene sequencing alignments data ?
>
> Thanks
>
> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi...@gmail.com> wrote:
>
>> I want to use spark for reading compressed .bed file for reading gene
>> sequencing alignments data.
>> I want to store bed file data in db and then use external gene expression
>> data to find overlaps etc, which database is best for it ?
>> Thanks
>> -Roni
>>
>>
>

Reply via email to