RE: Loading data from ranges of ordered subdirs

2013-06-10 Thread Rodrick Megraw
Thank you for the suggestions. Writing a custom LoadFunc seems like a valid solution for me, given that I don't currently have Hive or HCatalog installed and I'm working on more of an ad-hoc problem at this point. HCatalog seems like a good solution for doing this type of thing on a repeated

Re: running pig from eclipse on hadoop cluster

2013-06-10 Thread Weiping Qu
Hi, Forget the question raised before. It's solved. Hi, I am currently running pig from eclipse on hadoop cluster. I added the hadoop conf location to the runtime configuration. But the mapreduce jobs failed as the built class files of pig cannot be called by hadoop. I added class file locat

Re: Loading data from ranges of ordered subdirs

2013-06-10 Thread Pradeep Gollakota
There's two possibilites that come to mind. 1. Write a custom LoadFunc in which you can handle these regular expressions. *Not the most ideal solution* 2. Use HCatalog. The example they have in their documentation seems to fit your use case perfectly. (http://incubator.apache.org/hcatalog/docs/r0.

running pig from eclipse on hadoop cluster

2013-06-10 Thread Weiping Qu
Hi, I am currently running pig from eclipse on hadoop cluster. I added the hadoop conf location to the runtime configuration. But the mapreduce jobs failed as the built class files of pig cannot be called by hadoop. I added class file location to the classpath, but it did not work. Any hints?

Loading data from ranges of ordered subdirs

2013-06-10 Thread Rodrick Megraw
Let's say I have my input data from the past 12 months organized into subdirs by date: /data/2012-06-10 /data/2012-06-11 ... /data/2013-06-09 And now say that I want to run a Pig script to process data from a range of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The regex

Re: problems with .gz

2013-06-10 Thread Niels Basjes
Bzip2 is only splittable in newer versions of hadoop. On Jun 10, 2013 10:28 PM, "Alan Crosswell" wrote: > Ignore what I said and see > https://forums.aws.amazon.com/thread.jspa?threadID=51232 > > bzip2 was documented somewhere as being splittable but this appears to not > actually be implemented

Re: problems with .gz

2013-06-10 Thread Alan Crosswell
Ignore what I said and see https://forums.aws.amazon.com/thread.jspa?threadID=51232 bzip2 was documented somewhere as being splittable but this appears to not actually be implemented at least in AWS S3. /a On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell wrote: > Suggest that if you have a cho

Re: GROUP BY Issue

2013-06-10 Thread Gourav Sengupta
Hi Shahab, It will be great if someone can delete this email from PIG group. I am aware of this mistake and had posted this issue to HIVE group almost immediately. Regards, Gourav On Mon, Jun 10, 2013 at 5:28 PM, Shahab Yunus wrote: > Gourav, this is not a HIVE mailing list. It is PIG's. > > R

Re: problems with .gz

2013-06-10 Thread Alan Crosswell
Dunno, I'm guessing it would since each file is a different mapper. On Mon, Jun 10, 2013 at 1:12 PM, William Oberman wrote: > I'm using gzip as I had a huge S3 bucket of uncompressed files, and > s3distcp only supported {gz, lzo, snappy}. > > I haven't ever done this, but can I mix/match files?

Re: problems with .gz

2013-06-10 Thread William Oberman
I'm using gzip as I had a huge S3 bucket of uncompressed files, and s3distcp only supported {gz, lzo, snappy}. I haven't ever done this, but can I mix/match files? My backup processes add files to these buckets, so I could upload new files as *.bz. But then I'd have some files as *.gz, and other

Re: problems with .gz

2013-06-10 Thread Alan Crosswell
Suggest that if you have a choice, you use bzip2 compression instead of gzip as bzip2 is block-based and Pig can split reading a large bzipped file across multiple mappers while gzip can't be split that way. On Mon, Jun 10, 2013 at 12:06 PM, William Oberman wrote: > I still don't fully understan

Re: GROUP BY Issue

2013-06-10 Thread Shahab Yunus
Gourav, this is not a HIVE mailing list. It is PIG's. Regards, Shahab On Mon, Jun 10, 2013 at 10:39 AM, Gourav Sengupta wrote: > Hi, > > On running the following query I am getting multiple records with same > value of F1 > > SELECT F1, COUNT(*) > FROM > ( > SELECT F1, F2, COUNT(*) > FROM TABLE

Re: problems with .gz

2013-06-10 Thread William Oberman
I still don't fully understand (and am still debugging), but I have a "problem file" and a theory. The file has a "corrupt line" that is a huge block of null characters followed by a "\n" (other lines are json followed by "\n"). I'm thinking that's a problem with my cassandra -> s3 process, but i

GROUP BY Issue

2013-06-10 Thread Gourav Sengupta
Hi, On running the following query I am getting multiple records with same value of F1 SELECT F1, COUNT(*) FROM ( SELECT F1, F2, COUNT(*) FROM TABLE1 GROUP BY F1, F2 ) a GROUP BY F1; As per what I understand there are multiple number of records based on number of reducers. Replicating the test

Re: save several 64MB files in Pig Latin

2013-06-10 Thread Bertrand Dechoux
The hadoop wiki give a brief explanation : http://wiki.apache.org/hadoop/HowManyMapsAndReduces The logic is indeed the same for Pig because, under the hood, Pig will generated and optimize a workflow of MapReduce jobs. With a splittable 500MB file, if the provided number of map (10) is accepted,

Re: save several 64MB files in Pig Latin

2013-06-10 Thread Ruslan Al-Fakikh
Hi Pedro, Yes, Pig Latin is always compiled to MapReduce. Usually you don't have to specify the number of mappers (I am not sure whether you really can). If you have a file of 500MB and it is splittable then the number of mappers is automatically equals to 500MB / 64MB (block size) which is around

Re: save several 64MB files in Pig Latin

2013-06-10 Thread Pedro Sá da Costa
Yes, I understand the previous answers now. The reason of my question is because I was trying to "split" a file with pig latin by loading the file and writing portions of the file again in HDFS. With both replies, it seems that pig latin uses mapreduce to compute the scripts, correct? And in map r

Re: save several 64MB files in Pig Latin

2013-06-10 Thread Bertrand Dechoux
I wasn't clear. Specifying the size of the files is not your real aim, I guess. But you think that's what is needed in order to solve your problem that we don't know about. 500MB is not a really big file in itself and is not an issue for HDFS and MapReduce. There is no absolute way to know how muc