I am trying to look at problems reading a data file over 4G. In my testing I am trying to create such a file. My plan is to create a fasta file (a simple format used in biology) looking like >1 TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTGAAAA >2 GTCTGATCTAAATGCGACGACGTCTTTAGTGCTAAGTGGAACCCAATCTTAAGACCCAGGCTCTTAAGCAGAAACAGACCGTCCCTGCCTCCTGGAGTAT >3 ... I create a list with 5000 structures - use flatMap to add 5000 per entry and then either call saveAsText or dnaFragmentIterator = mySet.toLocalIterator(); and write to HDFS
Then I try to call JavaRDD<String> lines = ctx.textFile(hdfsFileName); what I get on a 16 node cluster 14/12/06 01:49:21 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(pltrd007.labs.uninett.no,50119) java.nio.channels.ClosedChannelException 2 14/12/06 01:49:35 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140711-081617-711206558-5050-2543-13 The code is at the line below - I did not want to spam the group although it is only a couple of pages - I am baffled - there are no issues when I create a few thousand records but things blow up when I try 25 million records or a file of 6B or so Can someone take a look - it is not a lot of code https://drive.google.com/file/d/0B4cgoSGuA4KWUmo3UzBZRmU5M3M/view?usp=sharing