Hello again, any comments on this?
Thanks, Anastasis On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <[email protected]> wrote: > Hello, > > I am working on a very small project for my university and I have a small > cluster with 2 worker nodes and 1 master node. I'm using Pig to do some > calculations and I have a question regarding small files. > > I have a UDF that is reading a small input (around 200k) and correlates the > data from HDFS. My first approach was to upload the small file onto HDFS and > later, by using getCacheFiles(), access it into my UDF. > > After though, I needed to change things in this small file and this meant to > delete the file on HDFS, re-upload it and re-run Pig. But in the end I need > to change this small file frequently and I wanted to bypass HDFS (because all > those read + write + read in pig again is very very slow for multiple > iterations of my script), so what I did was: > > === pig script === > %declare MYFILE `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print > $0}'` > > .... MyUDF( line, '$MYFILE') ..... > > In the beginning, it worked great. But later (when my file started to get > larger of 100KB) on pig was stacking and I had to kill it: > > 2013-09-27 16:14:47,722 [main] INFO > org.apache.pig.tools.parameters.PreprocessorContext - Executing command : cat > myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}' > ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999: > Unexpected internal error. Error executing shell command: cat myfile.txt | > awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of > 130 > > (btw is this a bug or something? should hung like that?) > > How can I manage small files in such cases so I don't need to re upload > everything in HDFS every time and make my iteration faster? > > Thanks, > Anastasis
