Small files

Anastasis Andronidis Fri, 27 Sep 2013 07:37:43 -0700

Hello,

I am working on a very small project for my university and I have a small 
cluster with 2 worker nodes and 1 master node. I'm using Pig to do some 
calculations and I have a question regarding small files.


I have a UDF that is reading a small input (around 200k) and correlates the 
data from HDFS. My first approach was to upload the small file onto HDFS and 
later, by using getCacheFiles(), access it into my UDF.

After though, I needed to change things in this small file and this meant to 
delete the file on HDFS, re-upload it and re-run Pig. But in the end I need to 
change this small file frequently and I wanted to bypass HDFS (because all 
those read + write + read in pig again is very very slow for multiple 
iterations of my script), so what I did was:

=== pig script ===
%declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print 
$0}'`

.... MyUDF( line, '$MYFILE') .....

In the beginning, it worked great. But later (when my file started to get 
larger of 100KB) on pig was stacking and I had to kill it:

2013-09-27 16:14:47,722 [main] INFO  
org.apache.pig.tools.parameters.PreprocessorContext - Executing command : cat 
myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999: 
Unexpected internal error. Error executing shell command: cat myfile.txt | awk 
'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of 130

(btw is this a bug or something? should hung like that?)

How can I manage small files in such cases so I don't need to re upload 
everything in HDFS every time and make my iteration faster?

Thanks,
Anastasis

Small files

Reply via email to