RE: How to Take the whole file as a partition

2015-09-03 Thread Ewan Leith
Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but 
returns a PortableDataStream per file. It might be a workable solution though 
you'll need to handle the binary to UTF-8 or equivalent conversion

Thanks,
Ewan

From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: 03 September 2015 15:22
To: user@spark.apache.org
Subject: How to Take the whole file as a partition

Hi All,

I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can 
read them as partition on the file level. Which means want the FileSplit turn 
off.

I know there are some solutions, but not very good in my case:
1, I can't use WholeTextFiles method, because my file is too big, I don't want 
to risk the performance.
2, I try to use newAPIHadoopFile and turnoff the file split:

lines = 
ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, 
LongWritable.class, Text.class, hadoopConf).values()

.map(new Function() {

@Override

public String call(Text arg0) throws Exception {

return arg0.toString();

}

});

This works for some cases, but it truncate some lines (I am not sure why, but 
it looks like there is a limit on this file reading). I have a feeling that the 
spark truncate this file on 2GB bytes. Anyway it happens (because same data has 
no issue when I use mapreduce to do the input), the spark sometimes do a trunc 
on very big file if try to read all of them.

3, I can do another way is distribute the file name as the input of the Spark 
and in function open stream to read the file directly. This is what I am 
planning to do but I think it is ugly. I want to know anyone have better 
solution for it?

BTW: the file currently in text format, but it might be parquet format later, 
that is also reason I don't like my third option.

Regards,

Shuai


Re: How to Take the whole file as a partition

2015-09-03 Thread Tao Lu
You situation is special. It seems to me Spark may not fit well in your
case.

You want to process the individual files (500M~2G) as a whole, you want
good performance.

You may want to write our own Scala/Java programs and distribute it along
with those files across your cluster, and run them in parallel.

If you insist on using Spark, maybe option 3 is closer.

Cheers,
Tao


On Thu, Sep 3, 2015 at 10:22 AM, Shuai Zheng  wrote:

> Hi All,
>
>
>
> I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark
> can read them as partition on the file level. Which means want the
> FileSplit turn off.
>
>
>
> I know there are some solutions, but not very good in my case:
>
> 1, I can’t use WholeTextFiles method, because my file is too big, I don’t
> want to risk the performance.
>
> 2, I try to use newAPIHadoopFile and turnoff the file split:
>
>
>
> lines =
> ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class,
> LongWritable.class, Text.class, hadoopConf).values()
>
>
> .map(new Function() {
>
>
> @Override
>
>
> public String call(Text arg0) throws Exception {
>
>
> return arg0.toString();
>
>
> }
>
>
> });
>
>
>
> This works for some cases, but it truncate some lines (I am not sure why,
> but it looks like there is a limit on this file reading). I have a feeling
> that the spark truncate this file on 2GB bytes. Anyway it happens (because
> same data has no issue when I use mapreduce to do the input), the spark
> sometimes do a trunc on very big file if try to read all of them.
>
>
>
> 3, I can do another way is distribute the file name as the input of the
> Spark and in function open stream to read the file directly. This is what I
> am planning to do but I think it is ugly. I want to know anyone have better
> solution for it?
>
>
>
> BTW: the file currently in text format, but it might be parquet format
> later, that is also reason I don’t like my third option.
>
>
>
> Regards,
>
>
>
> Shuai
>



-- 

Thanks!
Tao


RE: How to Take the whole file as a partition

2015-09-03 Thread Shuai Zheng
Hi, 

 

Will there any way to change the default split size when load data for Spark? 
By default it is 64M, I know how to change this in Hadoop Mapreduce, but not 
sure how to do this in Spark.

 

Regards,

 

Shuai

 

From: Tao Lu [mailto:taolu2...@gmail.com] 
Sent: Thursday, September 03, 2015 11:07 AM
To: Shuai Zheng
Cc: user
Subject: Re: How to Take the whole file as a partition

 

You situation is special. It seems to me Spark may not fit well in your case. 

 

You want to process the individual files (500M~2G) as a whole, you want good 
performance. 

 

You may want to write our own Scala/Java programs and distribute it along with 
those files across your cluster, and run them in parallel. 

 

If you insist on using Spark, maybe option 3 is closer. 

 

Cheers,

Tao

 

 

On Thu, Sep 3, 2015 at 10:22 AM, Shuai Zheng  wrote:

Hi All,

 

I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can 
read them as partition on the file level. Which means want the FileSplit turn 
off.

 

I know there are some solutions, but not very good in my case:

1, I can’t use WholeTextFiles method, because my file is too big, I don’t want 
to risk the performance.

2, I try to use newAPIHadoopFile and turnoff the file split:

 

lines = 
ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, 
LongWritable.class, Text.class, hadoopConf).values()


.map(new Function() {


@Override


public String call(Text arg0) throws Exception {


return arg0.toString();


}


});

 

This works for some cases, but it truncate some lines (I am not sure why, but 
it looks like there is a limit on this file reading). I have a feeling that the 
spark truncate this file on 2GB bytes. Anyway it happens (because same data has 
no issue when I use mapreduce to do the input), the spark sometimes do a trunc 
on very big file if try to read all of them.

 

3, I can do another way is distribute the file name as the input of the Spark 
and in function open stream to read the file directly. This is what I am 
planning to do but I think it is ugly. I want to know anyone have better 
solution for it?

 

BTW: the file currently in text format, but it might be parquet format later, 
that is also reason I don’t like my third option.

 

Regards,

 

Shuai





 

-- 



Thanks!
Tao