Re: Filtering out files in a bucket (update on HIVE-951)

Avram Aelony Mon, 24 Jan 2011 15:12:58 -0800

hmmm, I've seen mention of SymLink but I don't yet grasp how it works/applies 
to selecting files to process.  Also, I don't have much control over how the 
data gets to the bucket I end up reading from, hence the need to powerfully 
select.


Could you point me to some SymLink documentation or an example so I can give it 
a try?

Many thanks,
Avram


On Jan 24, 2011, at 3:03 PM, Edward Capriolo wrote:

> On Mon, Jan 24, 2011 at 5:58 PM, Avram Aelony <avramael...@eharmony.com> 
> wrote:
>> Hi,
>> 
>> I really like the virtual column feature in 0.7 that allows me to request 
>> INPUT__FILE__NAME and see the names of files that are being acted on.
>> 
>> Because I can see the files that are being read, I see that I am spending 
>> time querying many, many very large files, most of which I do not need to 
>> process because these extra files are in the same s3 bucket location that 
>> contains the files I need.
>> 
>> The files I do need to process only a represent a subset of all files in the 
>> bucket. Nevertheless, the files I am interested in are quite large, and 
>> large enough to make copying to hdfs unwieldy.
>> 
>> Since I know the files I want to process by name before the scan of all 
>> files, can I be more efficient and only process a selection of files from a 
>> bucket avoiding those I don't?
>> 
>> 
>> I guess I am still looking for something 
>> likehttps://issues.apache.org/jira/browse/HIVE-951
>> I tried sending this message to the dev list initially, but since I haven't 
>> seen a response yet, perhaps this list is more appropriate.
>> 
>> Any suggestions or update on HIVE-951 ?
>> 
>> 
>> Thanks,
>> Avram
> 
> We do have the SymLink input format. It is a little more work then
> hive-951 but accomplishes roughly the same thing.
> 
> Edward

Re: Filtering out files in a bucket (update on HIVE-951)

Reply via email to