Hi, I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed cache inside Pig?
My problem is that: I have lots of small files in hdfs. Let's say 10 files. Each files contain more than one rows but I need only one row. But there isn't any relationship between each other. So I filter them what I need and then join them without any relationship(cross join) This is my workaround solution: a = load(smallFile1) --ex: rows count: 1000 b = FILTER a BY myrow=='filter by exp1' c = load(smallFile2) --ex: rows count: 30000 d = FILTER c BY myrow2=='filter by exp2' e = CROSS b,d ... f = load(bigFile) --ex:rows count: 50mio g = CROSS e, f But it's performance isn't good enough. So if I can use distributed cache inside pig script, I can lookup the files which I first read and filter in the memory. What is your suggestion? Is there any other performance efficient way to do it? Thanks Best regards... -- *BURAK ISIKLI* | *http://burakisikli.wordpress.com <http://burakisikli.wordpress.com>*
