Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but
returns a PortableDataStream per file. It might be a workable solution though
you'll need to handle the binary to UTF-8 or equivalent conversion
Thanks,
Ewan
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: 03
You situation is special. It seems to me Spark may not fit well in your
case.
You want to process the individual files (500M~2G) as a whole, you want
good performance.
You may want to write our own Scala/Java programs and distribute it along
with those files across your cluster, and run them in
Hi,
Will there any way to change the default split size when load data for Spark?
By default it is 64M, I know how to change this in Hadoop Mapreduce, but not
sure how to do this in Spark.
Regards,
Shuai
From: Tao Lu [mailto:taolu2...@gmail.com]
Sent: Thursday, September 03,