Hi All,

 

I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark
can read them as partition on the file level. Which means want the FileSplit
turn off.

 

I know there are some solutions, but not very good in my case:

1, I can't use WholeTextFiles method, because my file is too big, I don't
want to risk the performance.

2, I try to use newAPIHadoopFile and turnoff the file split:

 

                                                lines =
ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class,
LongWritable.class, Text.class, hadoopConf).values()

 
.map(new Function<Text, String>() {

 
@Override

 
public String call(Text arg0) throws Exception {

 
return arg0.toString();

 
}

 
});

 

This works for some cases, but it truncate some lines (I am not sure why,
but it looks like there is a limit on this file reading). I have a feeling
that the spark truncate this file on 2GB bytes. Anyway it happens (because
same data has no issue when I use mapreduce to do the input), the spark
sometimes do a trunc on very big file if try to read all of them.

 

3, I can do another way is distribute the file name as the input of the
Spark and in function open stream to read the file directly. This is what I
am planning to do but I think it is ugly. I want to know anyone have better
solution for it?

 

BTW: the file currently in text format, but it might be parquet format
later, that is also reason I don't like my third option.

 

Regards,

 

Shuai

Reply via email to