Update: Using bytearray before storing to RDD is not a solution either. This happens when trying to read the RDD when the value was stored as python bytearray:
Traceback (most recent call last): [0/9120] File "/vagrant/python/kmeans.py", line 24, in <module> features = sc.sequenceFile(feature_sequencefile_path) File "/usr/local/spark/python/pyspark/context.py", line 490, in sequenceFile keyConverter, valueConverter, minSplits, batchSize) File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.sequenceFile. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/feature-bytearray at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD.take(RDD.scala:1156) at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:205) at org.apache.spark.api.python.PythonRDD$.sequenceFile(PythonRDD.scala:447) at org.apache.spark.api.python.PythonRDD.sequenceFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) On Tue, Jun 9, 2015 at 11:04 AM, Sam Stoelinga <sammiest...@gmail.com> wrote: > Hi all, > > I'm storing an rdd as sequencefile with the following content: > key=filename(string) value=python str from numpy.savez(not unicode) > > In order to make sure the whole numpy array get's stored I have to first > serialize it with: > def serialize_numpy_array(numpy_array): > output = io.BytesIO() > np.savez_compressed(output, x=numpy_array) > return output.getvalue() > > >> type(output.getvalue()) > str > > The deserialization returns a python str, *not unicode object*. After > deserialization I call > > my_dersialized_numpy_rdd.saveAsSequenceFile(path) > > all works well and the RDD get stored successfully. Now the problem starts > I want to read the sequencefile again: > > >> my_dersialized_numpy_rdd = sc.sequenceFile(path) > >> first = my_dersialized_numpy_rdd.first() > >> type(first[1]) > unicode > > The previous str became a unicode object after we stored it to a > sequencefile and read it again. Trying to convert it back with > first[1].decode("ascii") fails with UnicodeEncodeError: 'ascii' codec can't > encode characters in position 1-3: ordinal not in range(128) > > My expectation was that I would get the data back as how I stored it for > example in str format and not in unicode format. Anybody suggestion how I > can read back the original data. Will try converting the str to bytearray > before storing it to a seqeencefile. > > Thanks, > Sam Stoelinga > >