hmmm, I've seen mention of SymLink but I don't yet grasp how it works/applies to selecting files to process. Also, I don't have much control over how the data gets to the bucket I end up reading from, hence the need to powerfully select.
Could you point me to some SymLink documentation or an example so I can give it a try? Many thanks, Avram On Jan 24, 2011, at 3:03 PM, Edward Capriolo wrote: > On Mon, Jan 24, 2011 at 5:58 PM, Avram Aelony <avramael...@eharmony.com> > wrote: >> Hi, >> >> I really like the virtual column feature in 0.7 that allows me to request >> INPUT__FILE__NAME and see the names of files that are being acted on. >> >> Because I can see the files that are being read, I see that I am spending >> time querying many, many very large files, most of which I do not need to >> process because these extra files are in the same s3 bucket location that >> contains the files I need. >> >> The files I do need to process only a represent a subset of all files in the >> bucket. Nevertheless, the files I am interested in are quite large, and >> large enough to make copying to hdfs unwieldy. >> >> Since I know the files I want to process by name before the scan of all >> files, can I be more efficient and only process a selection of files from a >> bucket avoiding those I don't? >> >> >> I guess I am still looking for something >> likehttps://issues.apache.org/jira/browse/HIVE-951 >> I tried sending this message to the dev list initially, but since I haven't >> seen a response yet, perhaps this list is more appropriate. >> >> Any suggestions or update on HIVE-951 ? >> >> >> Thanks, >> Avram > > We do have the SymLink input format. It is a little more work then > hive-951 but accomplishes roughly the same thing. > > Edward