Re: Filtering out files in a bucket (update on HIVE-951)

Edward Capriolo Mon, 24 Jan 2011 15:03:42 -0800

On Mon, Jan 24, 2011 at 5:58 PM, Avram Aelony <avramael...@eharmony.com> wrote:
> Hi,
>
> I really like the virtual column feature in 0.7 that allows me to request 
> INPUT__FILE__NAME and see the names of files that are being acted on.
>
> Because I can see the files that are being read, I see that I am spending 
> time querying many, many very large files, most of which I do not need to 
> process because these extra files are in the same s3 bucket location that 
> contains the files I need.
>
> The files I do need to process only a represent a subset of all files in the 
> bucket. Nevertheless, the files I am interested in are quite large, and large 
> enough to make copying to hdfs unwieldy.
>
> Since I know the files I want to process by name before the scan of all 
> files, can I be more efficient and only process a selection of files from a 
> bucket avoiding those I don't?
>
>
> I guess I am still looking for something 
> likehttps://issues.apache.org/jira/browse/HIVE-951
> I tried sending this message to the dev list initially, but since I haven't 
> seen a response yet, perhaps this list is more appropriate.
>
> Any suggestions or update on HIVE-951 ?
>
>
> Thanks,
> Avram


We do have the SymLink input format. It is a little more work then
hive-951 but accomplishes roughly the same thing.

Edward

Re: Filtering out files in a bucket (update on HIVE-951)

Reply via email to