
I'm currently trying to use pyarrows hdfs lib from within hadoop
streaming, specifically in the reducer with python 3.6 (anaconda). But
the mentioned problem occurs either way. pyarrow version is 0.9.0

I'm starting the actual python script via a wrapper sh script that sets
the LD_LIBRARY_PATH, since I found that setting it from wihin python was
not sufficient..

When I'm just testing the reducer by piping in data manually and trying
to save data (in this case a gensim model) that is roughly 3GB I only
get the following error message:

File "reducer.py", line 104, in <module>
File "reducer.py", line 65, in save_model
  model.save(model_fd, sep_limit=1024 * 1024, pickle_protocol=4)
File "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", 
line 930, in save
  super(Word2Vec, self).save(*args, **kwargs)
line 281, in save
  super(BaseAny2VecModel, self).save(fname_or_handle, **kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 688, in 
  _pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
File "io.pxi", line 220, in pyarrow.lib.NativeFile.write
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS: Write failed

Files with 700MB in size seem to work fine though. Our default block
size is 128MB.

The code to save the model is the following:

model = word2vec.Word2Vec(size=300, workers=8, iter=1, sg=1)
# building model here [removed]
hdfs_client = hdfs.connect(active_master)
with hdfs_client.open("/user/zab/w2v/%s_test.model" % key, 'wb') as model_fd:
    model.save(model_fd, sep_limit=1024 * 1024)

I would appreciate any help :-)


Leibniz Universität Hannover
Institut für Verteilte Systeme
Appelstrasse 4 - 30167 Hannover
Phone:  +49 (0)511 762 - 17706
Tax ID/Steuernummer: DE811245527

Reply via email to