Hi all, I'm storing an rdd as sequencefile with the following content: key=filename(string) value=python str from numpy.savez(not unicode)
In order to make sure the whole numpy array get's stored I have to first serialize it with: def serialize_numpy_array(numpy_array): output = io.BytesIO() np.savez_compressed(output, x=numpy_array) return output.getvalue() >> type(output.getvalue()) str The deserialization returns a python str, *not unicode object*. After deserialization I call my_dersialized_numpy_rdd.saveAsSequenceFile(path) all works well and the RDD get stored successfully. Now the problem starts I want to read the sequencefile again: >> my_dersialized_numpy_rdd = sc.sequenceFile(path) >> first = my_dersialized_numpy_rdd.first() >> type(first[1]) unicode The previous str became a unicode object after we stored it to a sequencefile and read it again. Trying to convert it back with first[1].decode("ascii") fails with UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128) My expectation was that I would get the data back as how I stored it for example in str format and not in unicode format. Anybody suggestion how I can read back the original data. Will try converting the str to bytearray before storing it to a seqeencefile. Thanks, Sam Stoelinga