Re: Spark Python with SequenceFile containing numpy deserialized data in str form

Sam Stoelinga Mon, 08 Jun 2015 20:36:07 -0700

Update: Using bytearray before storing to RDD is not a solution either.
This happens when trying to read the RDD when the value was stored as
python bytearray:


Traceback (most recent call last):

        [0/9120]
  File "/vagrant/python/kmeans.py", line 24, in <module>
    features = sc.sequenceFile(feature_sequencefile_path)
  File "/usr/local/spark/python/pyspark/context.py", line 490, in
sequenceFile
    keyConverter, valueConverter, minSplits, batchSize)
  File
"/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.sequenceFile.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://localhost:9000/tmp/feature-bytearray
        at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1156)
        at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:205)
        at
org.apache.spark.api.python.PythonRDD$.sequenceFile(PythonRDD.scala:447)
        at
org.apache.spark.api.python.PythonRDD.sequenceFile(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)


On Tue, Jun 9, 2015 at 11:04 AM, Sam Stoelinga <sammiest...@gmail.com>
wrote:

> Hi all,
>
> I'm storing an rdd as sequencefile with the following content:
> key=filename(string) value=python str from numpy.savez(not unicode)
>
> In order to make sure the whole numpy array get's stored I have to first
> serialize it with:
> def serialize_numpy_array(numpy_array):
>     output = io.BytesIO()
>     np.savez_compressed(output, x=numpy_array)
>     return output.getvalue()
>
> >> type(output.getvalue())
> str
>
> The deserialization returns a python str, *not unicode object*. After
> deserialization I call
>
> my_dersialized_numpy_rdd.saveAsSequenceFile(path)
>
> all works well and the RDD get stored successfully. Now the problem starts
> I want to read the sequencefile again:
>
> >> my_dersialized_numpy_rdd = sc.sequenceFile(path)
> >> first = my_dersialized_numpy_rdd.first()
> >> type(first[1])
> unicode
>
> The previous str became a unicode object after we stored it to a
> sequencefile and read it again. Trying to convert it back with
> first[1].decode("ascii") fails with UnicodeEncodeError: 'ascii' codec can't
> encode characters in position 1-3: ordinal not in range(128)
>
> My expectation was that I would get the data back as how I stored it for
> example in str format and not in unicode format. Anybody suggestion how I
> can read back the original data. Will try converting the str to bytearray
> before storing it to a seqeencefile.
>
> Thanks,
> Sam Stoelinga
>
>

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

Reply via email to