Hello, Please forgive what I'm sure is a frequent question, but I have not been able to find a reasonable solution to what I'm sure is a very standard issue. I expect to have what I think must be a very common pattern: a pipeline element retrieves a large data set, performs an expensive computation to create one or more new columns, and wants to save out the expanded data set for downstream pipeline elements which will consume some of the new data and some of the old.
As I understand, there is no way to alter a persisted data set. Is that correct? If so, how do others address this situation? The obvious answer is to write a new data set, but that approach wastes space and encourages data duplication. One could write out the new columns only to a new dataset, but then we have to manage links between data sets. Is that managed by Arrow? If not, are there standard extensions for managing the links, or is there a better way? Thanks, Bill William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 [email protected]<mailto:[email protected]> CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
