Hello! I'm currently trying to use pyarrows hdfs lib from within hadoop streaming, specifically in the reducer with python 3.6 (anaconda). But the mentioned problem occurs either way. pyarrow version is 0.9.0
I'm starting the actual python script via a wrapper sh script that sets the LD_LIBRARY_PATH, since I found that setting it from wihin python was not sufficient.. When I'm just testing the reducer by piping in data manually and trying to save data (in this case a gensim model) that is roughly 3GB I only get the following error message: File "reducer.py", line 104, in <module> save_model(model) File "reducer.py", line 65, in save_model model.save(model_fd, sep_limit=1024 * 1024, pickle_protocol=4) File "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", line 930, in save super(Word2Vec, self).save(*args, **kwargs) File "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 281, in save super(BaseAny2VecModel, self).save(fname_or_handle, **kwargs) File "/opt/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 688, in save _pickle.dump(self, fname_or_handle, protocol=pickle_protocol) File "io.pxi", line 220, in pyarrow.lib.NativeFile.write File "error.pxi", line 79, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS: Write failed Files with 700MB in size seem to work fine though. Our default block size is 128MB. The code to save the model is the following: model = word2vec.Word2Vec(size=300, workers=8, iter=1, sg=1) # building model here [removed] hdfs_client = hdfs.connect(active_master) with hdfs_client.open("/user/zab/w2v/%s_test.model" % key, 'wb') as model_fd: model.save(model_fd, sep_limit=1024 * 1024) I would appreciate any help :-) Best, Jan -- Leibniz Universität Hannover Institut für Verteilte Systeme Appelstrasse 4 - 30167 Hannover Phone: +49 (0)511 762 - 17706 Tax ID/Steuernummer: DE811245527