lllangWV opened a new issue, #45113:
URL: https://github.com/apache/arrow/issues/45113
### Describe the usage question you have. Please include as many useful
details as possible.
**Title:** Improving Deserialization Speed for PyArrow to Python Objects
Hello,
I am working with materials data stored in Parquet files, where a column
`structure` contains serialized dictionaries representing structures from the
`Structure` class in the `pymatgen` package. This class stores site and lattice
information and provides a `.to_dict()` method for serialization.
I have a dataset of ~80,000 structures. To deserialize these into
`Structure` objects, I use the following process:
```python
ds = ds.dataset(dataset_dir, format="parquet")
table = ds.to_table(columns=['structure'])
df = table.to_pandas() # ~8.20 seconds
df['structure_py'] = df['structure'].map(Structure.from_dict) # ~116 seconds
```
The majority of the time is spent mapping the dictionaries to `Structure`
objects via `Structure.from_dict`. I attempted using `pa.ExtensionArray` and
`pa.ExtensionType` to optimize this process but achieved similar performance,
as the bottleneck appears to be in the `Structure.from_dict` calls.
Here's an example of my `ExtensionType` implementation:
```python
class StructureType(pa.ExtensionType):
def __init__(self, data_type: pa.DataType):
if not pa.types.is_struct(data_type):
raise TypeError(f"data_type must be a struct type, not
{data_type}")
super().__init__(data_type, "matgraphdb.structure")
def __arrow_ext_serialize__(self) -> bytes:
return b""
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
assert pa.types.is_struct(storage_type)
return StructureType(storage_type)
def __arrow_ext_class__(self):
return StructureArray
class StructureArray(pa.ExtensionArray):
def to_structure(self):
return self.storage.to_pandas().map(Structure.from_dict)
```
Despite these efforts, the deserialization time remains substantial. Below
is the type of the `structure` column:
```python
struct<@class: string, @module: string, charge: double, lattice: struct<a:
double, alpha: double, b: double, beta: double, c: double, gamma: double, ...>,
sites: list<element: struct<...>>>
```
Is there a recommended approach within PyArrow to speed up deserialization
of such complex structured data into Python objects?
Best regards,
Logan Lang
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]