I’m got terabytes of analytical data which need to be recomputed with 
additional layers of other analytical data and logical transformations..

If I were to express this using SQL concepts it would look something like:

Create a logical view on table what applies substring, filtering and other data 
cleanup transformations.
Create a logical view on a logical view that applies computations like applying 
fx_rates (another dataset), remapping + consolidating categories.
Other logical transformations..
Final SQL statement to select data out (materialization) and dump it to a file 
or insert into a another table..

It seems that pyarrow.dataset which is logical representation which is schema + 
source only supports filtering..

pyarrow.scanner supports filtering and projection. It even has attributes for 
dataset_schema and projection_schema, but this is an operation to start 
physical materialization.

I don’t want to create persistent temporary storage in the terabytes to capture 
filtered and projected results after each step in my processing pipeline.


From: Chang She <ch...@lancedb.com>
Sent: Wednesday, October 25, 2023 11:19 AM
To: user@arrow.apache.org
Subject: Re: Is it possible to add computed columns to a pyarrow dataset


External Email: Use caution with links and attachments
Do you already have a storage layer to persist these views or do you only need 
ephemeral views? Sounds interesting curious to find out more about your use case

On Wed, Oct 25, 2023 at 2:00 PM Lee, David (PAG) 
<david....@blackrock.com<mailto:david....@blackrock.com>> wrote:
Here's my ideal use case scenario..

Create multiple datasets mapped to different file directories.
Create more datasets by logically generating additional computed columns using 
expressions
Create joined dataset by joining datasets
Finally run a Scanner on the joined dataset to start materialization..

Pyarrow.Dataset.filter supports adding a @filter, but it doesn't have a 
@columns argument.
Pyarrow.Dataset.Scanner supports both @filter and @columns, but I don't want to 
create interim copies of data in memory.

Simplified example:
Give a table that captures local values like 'en-US', 'en-GB', 'fr-CA', etc..
I want to use a pyarrow logical expression to split this into language and 
country so I end up with:
Language: 'en', 'en', 'fr', ..
Country: 'US', 'GB', 'CA', ..
I then want to join Country to a Country dataset which contains Country and 
Country_Name
Language: 'en', 'en', 'fr', ..
Country: 'US', 'GB', 'CA', ..
Country_Name: 'USA', 'Great Britain', 'Cananda', ..

Basically can a dataset handle "logical" column projection to avoid physical 
materialization in memory?


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2023 BlackRock, Inc. All rights reserved.

Reply via email to