RE: Loading data from ranges of ordered subdirs

2013-06-10 Thread Rodrick Megraw
Thank you for the suggestions.

Writing a custom LoadFunc seems like a valid solution for me, given that I 
don't currently have Hive or HCatalog installed and I'm working on more of an 
ad-hoc problem at this point. 

HCatalog seems like a good solution for doing this type of thing on a repeated 
basis and with many data sets. I will read up on it for sure.

> Date: Mon, 10 Jun 2013 17:02:37 -0400
> Subject: Re: Loading data from ranges of ordered subdirs
> From: pradeep...@gmail.com
> To: user@pig.apache.org
> 
> There's two possibilites that come to mind.
> 
> 1. Write a custom LoadFunc in which you can handle these regular
> expressions. *Not the most ideal solution*
> 2. Use HCatalog. The example they have in their documentation seems to fit
> your use case perfectly. (http://incubator.apache.org/hcatalog/docs/r0.5.0/
> ).
> 
> There might be other ways to do this, but I'm not aware of them.
> 
> Hope this helps.
> 
> 
> On Mon, Jun 10, 2013 at 4:54 PM, Rodrick Megraw wrote:
> 
> > Let's say I have my input data from the past 12 months organized into
> > subdirs by date:
> >
> > /data/2012-06-10
> > /data/2012-06-11
> > ...
> > /data/2013-06-09
> >
> > And now say that I want to run a Pig script to process data from a range
> > of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The
> > regex that I could specify for this date range is going to get quite
> > complicated.
> >
> > Is there a way that I can get my Pig script to load data from such a range
> > without a regex?
> >
> > I could load all the data in /data/*, and then FILTER by the date field in
> > each record, but this is not desirable if the range of dates is small
> > compared to the entire dataset.
> >
  

Re: Loading data from ranges of ordered subdirs

2013-06-10 Thread Pradeep Gollakota
There's two possibilites that come to mind.

1. Write a custom LoadFunc in which you can handle these regular
expressions. *Not the most ideal solution*
2. Use HCatalog. The example they have in their documentation seems to fit
your use case perfectly. (http://incubator.apache.org/hcatalog/docs/r0.5.0/
).

There might be other ways to do this, but I'm not aware of them.

Hope this helps.


On Mon, Jun 10, 2013 at 4:54 PM, Rodrick Megraw wrote:

> Let's say I have my input data from the past 12 months organized into
> subdirs by date:
>
> /data/2012-06-10
> /data/2012-06-11
> ...
> /data/2013-06-09
>
> And now say that I want to run a Pig script to process data from a range
> of dates within the last 12 months, say 2012-11-07 through 2013-05-26. The
> regex that I could specify for this date range is going to get quite
> complicated.
>
> Is there a way that I can get my Pig script to load data from such a range
> without a regex?
>
> I could load all the data in /data/*, and then FILTER by the date field in
> each record, but this is not desirable if the range of dates is small
> compared to the entire dataset.
>