RE: How to Take the whole file as a partition
Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but returns a PortableDataStream per file. It might be a workable solution though you'll need to handle the binary to UTF-8 or equivalent conversion Thanks, Ewan From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: 03 September 2015 15:22 To: user@spark.apache.org Subject: How to Take the whole file as a partition Hi All, I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can read them as partition on the file level. Which means want the FileSplit turn off. I know there are some solutions, but not very good in my case: 1, I can't use WholeTextFiles method, because my file is too big, I don't want to risk the performance. 2, I try to use newAPIHadoopFile and turnoff the file split: lines = ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, LongWritable.class, Text.class, hadoopConf).values() .map(new Function() { @Override public String call(Text arg0) throws Exception { return arg0.toString(); } }); This works for some cases, but it truncate some lines (I am not sure why, but it looks like there is a limit on this file reading). I have a feeling that the spark truncate this file on 2GB bytes. Anyway it happens (because same data has no issue when I use mapreduce to do the input), the spark sometimes do a trunc on very big file if try to read all of them. 3, I can do another way is distribute the file name as the input of the Spark and in function open stream to read the file directly. This is what I am planning to do but I think it is ugly. I want to know anyone have better solution for it? BTW: the file currently in text format, but it might be parquet format later, that is also reason I don't like my third option. Regards, Shuai
Re: How to Take the whole file as a partition
You situation is special. It seems to me Spark may not fit well in your case. You want to process the individual files (500M~2G) as a whole, you want good performance. You may want to write our own Scala/Java programs and distribute it along with those files across your cluster, and run them in parallel. If you insist on using Spark, maybe option 3 is closer. Cheers, Tao On Thu, Sep 3, 2015 at 10:22 AM, Shuai Zhengwrote: > Hi All, > > > > I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark > can read them as partition on the file level. Which means want the > FileSplit turn off. > > > > I know there are some solutions, but not very good in my case: > > 1, I can’t use WholeTextFiles method, because my file is too big, I don’t > want to risk the performance. > > 2, I try to use newAPIHadoopFile and turnoff the file split: > > > > lines = > ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, > LongWritable.class, Text.class, hadoopConf).values() > > > .map(new Function () { > > > @Override > > > public String call(Text arg0) throws Exception { > > > return arg0.toString(); > > > } > > > }); > > > > This works for some cases, but it truncate some lines (I am not sure why, > but it looks like there is a limit on this file reading). I have a feeling > that the spark truncate this file on 2GB bytes. Anyway it happens (because > same data has no issue when I use mapreduce to do the input), the spark > sometimes do a trunc on very big file if try to read all of them. > > > > 3, I can do another way is distribute the file name as the input of the > Spark and in function open stream to read the file directly. This is what I > am planning to do but I think it is ugly. I want to know anyone have better > solution for it? > > > > BTW: the file currently in text format, but it might be parquet format > later, that is also reason I don’t like my third option. > > > > Regards, > > > > Shuai > -- Thanks! Tao
RE: How to Take the whole file as a partition
Hi, Will there any way to change the default split size when load data for Spark? By default it is 64M, I know how to change this in Hadoop Mapreduce, but not sure how to do this in Spark. Regards, Shuai From: Tao Lu [mailto:taolu2...@gmail.com] Sent: Thursday, September 03, 2015 11:07 AM To: Shuai Zheng Cc: user Subject: Re: How to Take the whole file as a partition You situation is special. It seems to me Spark may not fit well in your case. You want to process the individual files (500M~2G) as a whole, you want good performance. You may want to write our own Scala/Java programs and distribute it along with those files across your cluster, and run them in parallel. If you insist on using Spark, maybe option 3 is closer. Cheers, Tao On Thu, Sep 3, 2015 at 10:22 AM, Shuai Zhengwrote: Hi All, I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can read them as partition on the file level. Which means want the FileSplit turn off. I know there are some solutions, but not very good in my case: 1, I can’t use WholeTextFiles method, because my file is too big, I don’t want to risk the performance. 2, I try to use newAPIHadoopFile and turnoff the file split: lines = ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, LongWritable.class, Text.class, hadoopConf).values() .map(new Function () { @Override public String call(Text arg0) throws Exception { return arg0.toString(); } }); This works for some cases, but it truncate some lines (I am not sure why, but it looks like there is a limit on this file reading). I have a feeling that the spark truncate this file on 2GB bytes. Anyway it happens (because same data has no issue when I use mapreduce to do the input), the spark sometimes do a trunc on very big file if try to read all of them. 3, I can do another way is distribute the file name as the input of the Spark and in function open stream to read the file directly. This is what I am planning to do but I think it is ugly. I want to know anyone have better solution for it? BTW: the file currently in text format, but it might be parquet format later, that is also reason I don’t like my third option. Regards, Shuai -- Thanks! Tao