Thanks for your suggestion Daniel. I was already using SequenceFile but my format was poor. I was storing file contents as Text in my SeqFile,
So all my map jobs did repeated conversion from Text to double. I resolved this by correcting SequenceFile format. Now I store serialised java object in SeqFile and my map jobs are faster. -- Madhav Sharan On Wed, Aug 17, 2016 at 11:07 PM, Daniel Haviv <danielru...@gmail.com> wrote: > Store them within a sequencefile > > > On Thursday, 18 August 2016, Madhav Sharan <msha...@usc.edu> wrote: > >> Hi , can someone please recommend a fast way in hadoop to store and >> retrieve matrix of double values? >> >> As of now we store values in text files and the read it in java using >> HDFS inputstream and Scanner. *[0]* These files are actually vectors >> representing a video file. Each vector is 883 X 200 and for one map job we >> read 4 such vectors so *job is to convert 706,400 values to double*. >> >> Using this approach we take ~ 1.5 second to convert all these values. I >> can use a external cache server to avoid repeated conversion but I am >> looking for a better solution. >> >> [0] - https://github.com/USCDataScience/hadoop-pot/blob/master/ >> src/main/java/org/pooledtimeseries/PoT.java#L596 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_USCDataScience_hadoop-2Dpot_blob_master_src_main_java_org_pooledtimeseries_PoT.java-23L596&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=6105jJkHPEbDi_yojUYcLP3vpvkzg0AV-r1MdgyCG1g&s=PNNdBOT8PCJ4RFaHzF9EYPJaDfjlLKJfyvlIobonBxA&e=> >> >> >> -- >> Madhav Sharan >> >>