[jira] [Updated] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:51:04 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rok Mihevc updated ARROW-5144:
------------------------------
    External issue URL: https://github.com/apache/arrow/issues/21625

> [Python] ParquetDataset and ParquetPiece not serializable
> ---------------------------------------------------------
>
>                 Key: ARROW-5144
>                 URL: https://issues.apache.org/jira/browse/ARROW-5144
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp                 0.13.0           py36ha71616b_0    conda-forge
> pyarrow                   0.13.0           py36hb37e6aa_0    conda-forge
>            Reporter: Martin Durant
>            Assignee: Krisztian Szucs
>            Priority: Critical
>              Labels: parquet, pull-request-available
>             Fix For: 0.14.0
>
>         Attachments: part.0.parquet
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that 
> dask.distributed cannot pass them between processes in order to load parquet 
> in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dumps(obj, protocol)
>     893     try:
>     894         cp = CloudPickler(file, protocol=protocol)
> --> 895         cp.dump(obj)
>     896         return file.getvalue()
>     897     finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py 
> in dump(self, obj)
>     266         self.inject_addons()
>     267         try:
> --> 268             return Pickler.dump(self, obj)
>     269         except RuntimeError as e:
>     270             if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
>     407         if self.proto >= 4:
>     408             self.framer.start_framing()
> --> 409         self.save(obj)
>     410         self.write(STOP)
>     411         self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
>     519
>     520         # Save the reduce() output and finally memoize the object
> --> 521         self.save_reduce(obj=obj, *rv)
>     522
>     523     def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, 
> state, listitems, dictitems, obj)
>     632
>     633         if state is not None:
> --> 634             save(state)
>     635             write(BUILD)
>     636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
>     474         f = self.dispatch.get(t)
>     475         if f is not None:
> --> 476             f(self, obj) # Call unbound method with explicit self
>     477             return
>     478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
>     819
>     820         self.memoize(obj)
> --> 821         self._batch_setitems(obj.items())
>     822
>     823     dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
>     845                 for k, v in tmp:
>     846                     save(k)
> --> 847                     save(v)
>     848                 write(SETITEMS)
>     849             elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, 
> save_persistent_id)
>     494             reduce = getattr(obj, "__reduce_ex__", None)
>     495             if reduce is not None:
> --> 496                 rv = reduce(self.proto)
>     497             else:
>     498                 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so
>  in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-5144) [Python] ParquetDataset and ParquetPiece not serializable

Reply via email to