RE: How to Take the whole file as a partition

2015-09-03 Thread Ewan Leith
Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but returns a PortableDataStream per file. It might be a workable solution though you'll need to handle the binary to UTF-8 or equivalent conversion Thanks, Ewan From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: 03

Re: How to Take the whole file as a partition

2015-09-03 Thread Tao Lu
You situation is special. It seems to me Spark may not fit well in your case. You want to process the individual files (500M~2G) as a whole, you want good performance. You may want to write our own Scala/Java programs and distribute it along with those files across your cluster, and run them in

RE: How to Take the whole file as a partition

2015-09-03 Thread Shuai Zheng
Hi, Will there any way to change the default split size when load data for Spark? By default it is 64M, I know how to change this in Hadoop Mapreduce, but not sure how to do this in Spark. Regards, Shuai From: Tao Lu [mailto:taolu2...@gmail.com] Sent: Thursday, September 03,