[ https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-5144: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21625 > [Python] ParquetDataset and ParquetPiece not serializable > --------------------------------------------------------- > > Key: ARROW-5144 > URL: https://issues.apache.org/jira/browse/ARROW-5144 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0 > Environment: osx python36/conda cloudpickle 0.8.1 > arrow-cpp 0.13.0 py36ha71616b_0 conda-forge > pyarrow 0.13.0 py36hb37e6aa_0 conda-forge > Reporter: Martin Durant > Assignee: Krisztian Szucs > Priority: Critical > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Attachments: part.0.parquet > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Since 0.13.0, parquet instances are no longer serialisable, which means that > dask.distributed cannot pass them between processes in order to load parquet > in parallel. > Example: > ``` > >>> import cloudpickle > >>> import pyarrow.parquet as pq > >>> pf = pq.ParquetDataset('nation.impala.parquet') > >>> cloudpickle.dumps(pf) > ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py > in dumps(obj, protocol) > 893 try: > 894 cp = CloudPickler(file, protocol=protocol) > --> 895 cp.dump(obj) > 896 return file.getvalue() > 897 finally: > ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py > in dump(self, obj) > 266 self.inject_addons() > 267 try: > --> 268 return Pickler.dump(self, obj) > 269 except RuntimeError as e: > 270 if 'recursion' in e.args[0]: > ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj) > 407 if self.proto >= 4: > 408 self.framer.start_framing() > --> 409 self.save(obj) > 410 self.write(STOP) > 411 self.framer.end_framing() > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, > save_persistent_id) > 519 > 520 # Save the reduce() output and finally memoize the object > --> 521 self.save_reduce(obj=obj, *rv) > 522 > 523 def persistent_id(self, obj): > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, > state, listitems, dictitems, obj) > 632 > 633 if state is not None: > --> 634 save(state) > 635 write(BUILD) > 636 > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, > save_persistent_id) > 474 f = self.dispatch.get(t) > 475 if f is not None: > --> 476 f(self, obj) # Call unbound method with explicit self > 477 return > 478 > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj) > 819 > 820 self.memoize(obj) > --> 821 self._batch_setitems(obj.items()) > 822 > 823 dispatch[dict] = save_dict > ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items) > 845 for k, v in tmp: > 846 save(k) > --> 847 save(v) > 848 write(SETITEMS) > 849 elif n: > ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, > save_persistent_id) > 494 reduce = getattr(obj, "__reduce_ex__", None) > 495 if reduce is not None: > --> 496 rv = reduce(self.proto) > 497 else: > 498 reduce = getattr(obj, "__reduce__", None) > ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so > in pyarrow._parquet.ParquetSchema.__reduce_cython__() > TypeError: no default __reduce__ due to non-trivial __cinit__ > ``` > The indicated schema instance is also referenced by the ParquetDatasetPiece s. > ref: https://github.com/dask/distributed/issues/2597 -- This message was sent by Atlassian Jira (v8.20.10#820010)