Re: recombining split files after data is processed

2015-02-23 Thread Alexander Alten-Lorenz
You could attach the hadoop dfs command per bootstrap.
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one 
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one

BR,
 Alex


 On 23 Feb 2015, at 08:10, Jonathan Aquilina jaquil...@eagleeyet.net wrote:
 
 Thanks Alex. where would that command be placed in a mapper or reducer or run 
 as a command. Here at work we are looking to use Amazon EMR to do our number 
 crunching and we have access to the master node, but not really the rest of 
 the cluster. Can this be added as a step to be run after initial processing?
 
  
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-02-23 08:05, Alexander Alten-Lorenz wrote:
 
 Hi,
  
 You can use an single reducer 
 (http://wiki.apache.org/hadoop/HowManyMapsAndReduces 
 http://wiki.apache.org/hadoop/HowManyMapsAndReduces) for smaller datasets, 
 or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name
 
  
 BR,
  Alex
  
 
 On 23 Feb 2015, at 08:00, Jonathan Aquilina jaquil...@eagleeyet.net 
 mailto:jaquil...@eagleeyet.net wrote:
 
 Hey all,
 
 I understand that the purpose of splitting files is to distribute the data 
 to multiple core and task nodes in a cluster. My question is that after the 
 output is complete is there a way one can combine all the parts into a 
 single file?
 
  
 -- 
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T



Re: recombining split files after data is processed

2015-02-22 Thread Jonathan Aquilina
 

Thanks Alex. where would that command be placed in a mapper or reducer
or run as a command. Here at work we are looking to use Amazon EMR to do
our number crunching and we have access to the master node, but not
really the rest of the cluster. Can this be added as a step to be run
after initial processing? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: 

 Hi, 
 
 You can use an single reducer 
 (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller 
 datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name 
 
 BR, 
 Alex 
 
 On 23 Feb 2015, at 08:00, Jonathan Aquilina jaquil...@eagleeyet.net wrote: 
 
 Hey all, 
 
 I understand that the purpose of splitting files is to distribute the data 
 to multiple core and task nodes in a cluster. My question is that after the 
 output is complete is there a way one can combine all the parts into a 
 single file? 
 
 -- 
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 

Links:
--
[1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces