Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but
returns a PortableDataStream per file. It might be a workable solution though
you'll need to handle the binary to UTF-8 or equivalent conversion
Thanks,
Ewan
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: 03
Hi All,
I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark
can read them as partition on the file level. Which means want the FileSplit
turn off.
I know there are some solutions, but not very good in my case:
1, I can't use WholeTextFiles method, because my file is
You situation is special. It seems to me Spark may not fit well in your
case.
You want to process the individual files (500M~2G) as a whole, you want
good performance.
You may want to write our own Scala/Java programs and distribute it along
with those files across your cluster, and run them in
Hi,
Will there any way to change the default split size when load data for Spark?
By default it is 64M, I know how to change this in Hadoop Mapreduce, but not
sure how to do this in Spark.
Regards,
Shuai
From: Tao Lu [mailto:taolu2...@gmail.com]
Sent: Thursday, September 03,