Thank you for the suggestions. Writing a custom LoadFunc seems like a valid solution for me, given that I don't currently have Hive or HCatalog installed and I'm working on more of an ad-hoc problem at this point.
HCatalog seems like a good solution for doing this type of thing on a repeated basis and with many data sets. I will read up on it for sure. > Date: Mon, 10 Jun 2013 17:02:37 -0400 > Subject: Re: Loading data from ranges of ordered subdirs > From: pradeep...@gmail.com > To: user@pig.apache.org > > There's two possibilites that come to mind. > > 1. Write a custom LoadFunc in which you can handle these regular > expressions. *Not the most ideal solution* > 2. Use HCatalog. The example they have in their documentation seems to fit > your use case perfectly. (http://incubator.apache.org/hcatalog/docs/r0.5.0/ > ). > > There might be other ways to do this, but I'm not aware of them. > > Hope this helps. > > > On Mon, Jun 10, 2013 at 4:54 PM, Rodrick Megraw <remeg...@hotmail.com>wrote: > > > Let's say I have my input data from the past 12 months organized into > > subdirs by date: > > > > /data/2012-06-10 > > /data/2012-06-11 > > ... > > /data/2013-06-09 > > > > And now say that I want to run a Pig script to process data from a range > > of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The > > regex that I could specify for this date range is going to get quite > > complicated. > > > > Is there a way that I can get my Pig script to load data from such a range > > without a regex? > > > > I could load all the data in /data/*, and then FILTER by the date field in > > each record, but this is not desirable if the range of dates is small > > compared to the entire dataset. > >