subject:"RE\: Control default partition when load a RDD from HDFS"

RE: Control default partition when load a RDD from HDFS

2014-12-18 Thread Shuai Zheng

Hmmm, how to do that? You mean for each file create a RDD? Then I will have
tons of RDD.

And my calculation need to rely on other input, not just the file itself

 

Can you show some pseudo code for that logic?

 

Regards,

 

Shuai

 

From: Diego García Valverde [mailto:dgarci...@agbar.es] 
Sent: Wednesday, December 17, 2014 11:04 AM
To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

 

Why not is a good option to create a RDD per each 200Mb file and then apply
the pre-calculations before merging them? I think the partitions per RDD
must be transparent to the pre-calculations, and not to set them fixed to
optimize the spark maps/reduces processes.

 

De: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Enviado el: miércoles, 17 de diciembre de 2014 16:01
Para: 'Sun, Rui'; user@spark.apache.org
Asunto: RE: Control default partition when load a RDD from HDFS

 

Nice, that is the answer I want. 

Thanks!

 

From: Sun, Rui [mailto:rui@intel.com] 
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

 

Hi, Shuai,

 

How did you turn off the file split in Hadoop? I guess you might have
implemented a customized FileInputFormat which overrides isSplitable() to
return FALSE. If you do have such FileInputFormat, you can simply pass it as
a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

 

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.org
Subject: Control default partition when load a RDD from HDFS

 

Hi All,

 

My application load 1000 files, each file from 200M   a few GB, and combine
with other data to do calculation. 

Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation. 

In Hadoop, it is simple because I can turn-off the file split for input
format (to enforce each file will go to same mapper), then I will do the
file level calculation in mapper and pass result to reducer. But in spark,
how can I do it?

Basically I want to make sure after I load these files into RDD, it is
partitioned by file (not split file and also no merge there), so I can call
mapPartitions. Is it any way I can control the default partition when I load
the RDD? 

This might be the default behavior that spark do the partition (partitioned
by file when first time load the RDD), but I cant find any document to
support my guess, if not, can I enforce this kind of partition? Because the
total file size is bigger, I dont want to re-partition in the code. 

 

Regards,

 

Shuai

 

  _  

Disclaimer: http://disclaimer.agbar.com

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Shuai Zheng

Nice, that is the answer I want. 

Thanks!

From: Sun, Rui [mailto:rui@intel.com] 
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

Hi, Shuai,

How did you turn off the file split in Hadoop? I guess you might have
implemented a customized FileInputFormat which overrides isSplitable() to
return FALSE. If you do have such FileInputFormat, you can simply pass it as
a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.org
Subject: Control default partition when load a RDD from HDFS

Hi All,

My application load 1000 files, each file from 200M -  a few GB, and combine
with other data to do calculation. 

Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation. 

In Hadoop, it is simple because I can turn-off the file split for input
format (to enforce each file will go to same mapper), then I will do the
file level calculation in mapper and pass result to reducer. But in spark,
how can I do it?

Basically I want to make sure after I load these files into RDD, it is
partitioned by file (not split file and also no merge there), so I can call
mapPartitions. Is it any way I can control the default partition when I load
the RDD? 

This might be the default behavior that spark do the partition (partitioned
by file when first time load the RDD), but I can't find any document to
support my guess, if not, can I enforce this kind of partition? Because the
total file size is bigger, I don't want to re-partition in the code. 

Regards,

Shuai

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Diego García Valverde

Why not is a good option to create a RDD per each 200Mb file and then apply the 
pre-calculations before merging them? I think the partitions per RDD must be 
transparent to the pre-calculations, and not to set them fixed to optimize the 
spark maps/reduces processes.

De: Shuai Zheng [mailto:szheng.c...@gmail.com]
Enviado el: miércoles, 17 de diciembre de 2014 16:01
Para: 'Sun, Rui'; user@spark.apache.org
Asunto: RE: Control default partition when load a RDD from HDFS

Nice, that is the answer I want.
Thanks!

From: Sun, Rui [mailto:rui@intel.com]
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS

Hi, Shuai,

How did you turn off the file split in Hadoop? I guess you might have 
implemented a customized FileInputFormat which overrides isSplitable() to 
return FALSE. If you do have such FileInputFormat, you can simply pass it as a 
constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Control default partition when load a RDD from HDFS

Hi All,

My application load 1000 files, each file from 200M -  a few GB, and combine 
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the 
result need to combine to do further calculation.
In Hadoop, it is simple because I can turn-off the file split for input format 
(to enforce each file will go to same mapper), then I will do the file level 
calculation in mapper and pass result to reducer. But in spark, how can I do it?
Basically I want to make sure after I load these files into RDD, it is 
partitioned by file (not split file and also no merge there), so I can call 
mapPartitions. Is it any way I can control the default partition when I load 
the RDD?
This might be the default behavior that spark do the partition (partitioned by 
file when first time load the RDD), but I can't find any document to support my 
guess, if not, can I enforce this kind of partition? Because the total file 
size is bigger, I don't want to re-partition in the code.

Regards,

Shuai


Disclaimer: http://disclaimer.agbar.com

RE: Control default partition when load a RDD from HDFS

2014-12-16 Thread Sun, Rui

Hi, Shuai,

How did you turn off the file split in Hadoop? I guess you might have 
implemented a customized FileInputFormat which overrides isSplitable() to 
return FALSE. If you do have such FileInputFormat, you can simply pass it as a 
constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, December 17, 2014 4:16 AM
To: user@spark.apache.org
Subject: Control default partition when load a RDD from HDFS

Hi All,

My application load 1000 files, each file from 200M -  a few GB, and combine 
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the 
result need to combine to do further calculation.
In Hadoop, it is simple because I can turn-off the file split for input format 
(to enforce each file will go to same mapper), then I will do the file level 
calculation in mapper and pass result to reducer. But in spark, how can I do it?
Basically I want to make sure after I load these files into RDD, it is 
partitioned by file (not split file and also no merge there), so I can call 
mapPartitions. Is it any way I can control the default partition when I load 
the RDD?
This might be the default behavior that spark do the partition (partitioned by 
file when first time load the RDD), but I can't find any document to support my 
guess, if not, can I enforce this kind of partition? Because the total file 
size is bigger, I don't want to re-partition in the code.

Regards,

Shuai

RE: Control default partition when load a RDD from HDFS

RE: Control default partition when load a RDD from HDFS

RE: Control default partition when load a RDD from HDFS

RE: Control default partition when load a RDD from HDFS

4 matches

Site Navigation

Mail list logo

Footer information