RE: Control default partition when load a RDD from HDFS
Hmmm, how to do that? You mean for each file create a RDD? Then I will have tons of RDD. And my calculation need to rely on other input, not just the file itself Can you show some pseudo code for that logic? Regards, Shuai From: Diego García Valverde [mailto:dgarci...@agbar.es] Sent: Wednesday, December 17, 2014 11:04 AM To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Why not is a good option to create a RDD per each 200Mb file and then apply the pre-calculations before merging them? I think the partitions per RDD must be transparent to the pre-calculations, and not to set them fixed to optimize the spark maps/reduces processes. De: Shuai Zheng [mailto:szheng.c...@gmail.com] Enviado el: miércoles, 17 de diciembre de 2014 16:01 Para: 'Sun, Rui'; user@spark.apache.org Asunto: RE: Control default partition when load a RDD from HDFS Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark. From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, December 17, 2014 4:16 AM To: user@spark.apache.org Subject: Control default partition when load a RDD from HDFS Hi All, My application load 1000 files, each file from 200M a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can turn-off the file split for input format (to enforce each file will go to same mapper), then I will do the file level calculation in mapper and pass result to reducer. But in spark, how can I do it? Basically I want to make sure after I load these files into RDD, it is partitioned by file (not split file and also no merge there), so I can call mapPartitions. Is it any way I can control the default partition when I load the RDD? This might be the default behavior that spark do the partition (partitioned by file when first time load the RDD), but I cant find any document to support my guess, if not, can I enforce this kind of partition? Because the total file size is bigger, I dont want to re-partition in the code. Regards, Shuai _ Disclaimer: http://disclaimer.agbar.com
RE: Control default partition when load a RDD from HDFS
Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark. From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, December 17, 2014 4:16 AM To: user@spark.apache.org Subject: Control default partition when load a RDD from HDFS Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can turn-off the file split for input format (to enforce each file will go to same mapper), then I will do the file level calculation in mapper and pass result to reducer. But in spark, how can I do it? Basically I want to make sure after I load these files into RDD, it is partitioned by file (not split file and also no merge there), so I can call mapPartitions. Is it any way I can control the default partition when I load the RDD? This might be the default behavior that spark do the partition (partitioned by file when first time load the RDD), but I can't find any document to support my guess, if not, can I enforce this kind of partition? Because the total file size is bigger, I don't want to re-partition in the code. Regards, Shuai
RE: Control default partition when load a RDD from HDFS
Why not is a good option to create a RDD per each 200Mb file and then apply the pre-calculations before merging them? I think the partitions per RDD must be transparent to the pre-calculations, and not to set them fixed to optimize the spark maps/reduces processes. De: Shuai Zheng [mailto:szheng.c...@gmail.com] Enviado el: miércoles, 17 de diciembre de 2014 16:01 Para: 'Sun, Rui'; user@spark.apache.org Asunto: RE: Control default partition when load a RDD from HDFS Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark. From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, December 17, 2014 4:16 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Control default partition when load a RDD from HDFS Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can turn-off the file split for input format (to enforce each file will go to same mapper), then I will do the file level calculation in mapper and pass result to reducer. But in spark, how can I do it? Basically I want to make sure after I load these files into RDD, it is partitioned by file (not split file and also no merge there), so I can call mapPartitions. Is it any way I can control the default partition when I load the RDD? This might be the default behavior that spark do the partition (partitioned by file when first time load the RDD), but I can't find any document to support my guess, if not, can I enforce this kind of partition? Because the total file size is bigger, I don't want to re-partition in the code. Regards, Shuai Disclaimer: http://disclaimer.agbar.com
RE: Control default partition when load a RDD from HDFS
Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark. From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, December 17, 2014 4:16 AM To: user@spark.apache.org Subject: Control default partition when load a RDD from HDFS Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can turn-off the file split for input format (to enforce each file will go to same mapper), then I will do the file level calculation in mapper and pass result to reducer. But in spark, how can I do it? Basically I want to make sure after I load these files into RDD, it is partitioned by file (not split file and also no merge there), so I can call mapPartitions. Is it any way I can control the default partition when I load the RDD? This might be the default behavior that spark do the partition (partitioned by file when first time load the RDD), but I can't find any document to support my guess, if not, can I enforce this kind of partition? Because the total file size is bigger, I don't want to re-partition in the code. Regards, Shuai