Re: Small files

Anastasis Andronidis Sun, 29 Sep 2013 23:38:46 -0700

Hello again,

any comments on this?


Thanks,
Anastasis

On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <[email protected]> 
wrote:

> Hello,
> 
> I am working on a very small project for my university and I have a small 
> cluster with 2 worker nodes and 1 master node. I'm using Pig to do some 
> calculations and I have a question regarding small files.
> 
> I have a UDF that is reading a small input (around 200k) and correlates the 
> data from HDFS. My first approach was to upload the small file onto HDFS and 
> later, by using getCacheFiles(), access it into my UDF.
> 
> After though, I needed to change things in this small file and this meant to 
> delete the file on HDFS, re-upload it and re-run Pig. But in the end I need 
> to change this small file frequently and I wanted to bypass HDFS (because all 
> those read + write + read in pig again is very very slow for multiple 
> iterations of my script), so what I did was:
> 
> === pig script ===
> %declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print 
> $0}'`
> 
> .... MyUDF( line, '$MYFILE') .....
> 
> In the beginning, it worked great. But later (when my file started to get 
> larger of 100KB) on pig was stacking and I had to kill it:
> 
> 2013-09-27 16:14:47,722 [main] INFO  
> org.apache.pig.tools.parameters.PreprocessorContext - Executing command : cat 
> myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
> ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999: 
> Unexpected internal error. Error executing shell command: cat myfile.txt | 
> awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of 
> 130
> 
> (btw is this a bug or something? should hung like that?)
> 
> How can I manage small files in such cases so I don't need to re upload 
> everything in HDFS every time and make my iteration faster?
> 
> Thanks,
> Anastasis

Re: Small files

Reply via email to