Hi David,

For use cases like this on large data, solutions that compose together
PyArrow and DuckDB are often effective. Recent collaborations between
DuckDB Labs and Voltron Data have yielded excellent performance for
exchange of data between these components, and recent work by the
DuckDB developers has improved its out-of-core performance. I'm not
sure if there are any existing examples of PyArrow + DuckDB applied in
exactly the way you describe; I will see if we can develop an example
like that.

Ian

On Thu, Oct 26, 2023 at 3:37 PM Lee, David (PAG)
<david....@blackrock.com> wrote:
>
> I’m got terabytes of analytical data which need to be recomputed with 
> additional layers of other analytical data and logical transformations..
>
>
>
> If I were to express this using SQL concepts it would look something like:
>
>
>
> Create a logical view on table what applies substring, filtering and other 
> data cleanup transformations.
>
> Create a logical view on a logical view that applies computations like 
> applying fx_rates (another dataset), remapping + consolidating categories.
>
> Other logical transformations..
>
> Final SQL statement to select data out (materialization) and dump it to a 
> file or insert into a another table..
>
>
>
> It seems that pyarrow.dataset which is logical representation which is schema 
> + source only supports filtering..
>
>
>
> pyarrow.scanner supports filtering and projection. It even has attributes for 
> dataset_schema and projection_schema, but this is an operation to start 
> physical materialization.
>
>
>
> I don’t want to create persistent temporary storage in the terabytes to 
> capture filtered and projected results after each step in my processing 
> pipeline.
>
>
>
>
>
> From: Chang She <ch...@lancedb.com>
> Sent: Wednesday, October 25, 2023 11:19 AM
> To: user@arrow.apache.org
> Subject: Re: Is it possible to add computed columns to a pyarrow dataset
>
>
>
> External Email: Use caution with links and attachments
>
> Do you already have a storage layer to persist these views or do you only 
> need ephemeral views? Sounds interesting curious to find out more about your 
> use case
>
>
>
> On Wed, Oct 25, 2023 at 2:00 PM Lee, David (PAG) <david....@blackrock.com> 
> wrote:
>
> Here's my ideal use case scenario..
>
> Create multiple datasets mapped to different file directories.
> Create more datasets by logically generating additional computed columns 
> using expressions
> Create joined dataset by joining datasets
> Finally run a Scanner on the joined dataset to start materialization..
>
> Pyarrow.Dataset.filter supports adding a @filter, but it doesn't have a 
> @columns argument.
> Pyarrow.Dataset.Scanner supports both @filter and @columns, but I don't want 
> to create interim copies of data in memory.
>
> Simplified example:
> Give a table that captures local values like 'en-US', 'en-GB', 'fr-CA', etc..
> I want to use a pyarrow logical expression to split this into language and 
> country so I end up with:
> Language: 'en', 'en', 'fr', ..
> Country: 'US', 'GB', 'CA', ..
> I then want to join Country to a Country dataset which contains Country and 
> Country_Name
> Language: 'en', 'en', 'fr', ..
> Country: 'US', 'GB', 'CA', ..
> Country_Name: 'USA', 'Great Britain', 'Cananda', ..
>
> Basically can a dataset handle "logical" column projection to avoid physical 
> materialization in memory?
>
>
> This message may contain information that is confidential or privileged. If 
> you are not the intended recipient, please advise the sender immediately and 
> delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
> information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2023 BlackRock, Inc. All rights reserved.

Reply via email to