I've verified the issue lies within Spark running OpenCV code and not
within the sequence file BytesWritable formatting.

This is the code which can reproduce that spark is causing the failure by
not using the sequencefile as input at all but running the same function
with same input on spark but fails:

def extract_sift_features_opencv(imgfile_imgbytes):
    imgfilename, discardsequencefile = imgfile_imgbytes
    imgbytes = bytearray(open("/tmp/img.jpg", "rb").read())
    nparr = np.fromstring(buffer(imgbytes), np.uint8)
    img = cv2.imdecode(nparr, 1)
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    sift = cv2.xfeatures2d.SIFT_create()
    kp, descriptors = sift.detectAndCompute(gray, None)
    return (imgfilename, "test")

And corresponding tests.py:
https://gist.github.com/samos123/d383c26f6d47d34d32d6


On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga <sammiest...@gmail.com>
wrote:

> Thanks for the advice! The following line causes spark to crash:
>
> kp, descriptors = sift.detectAndCompute(gray, None)
>
> But I do need this line to be executed and the code does not crash when
> running outside of Spark but passing the same parameters. You're saying
> maybe the bytes from the sequencefile got somehow transformed and don't
> represent an image anymore causing OpenCV to crash the whole python
> executor.
>
> On Fri, May 29, 2015 at 2:06 AM, Davies Liu <dav...@databricks.com> wrote:
>
>> Could you try to comment out some lines in
>> `extract_sift_features_opencv` to find which line cause the crash?
>>
>> If the bytes came from sequenceFile() is broken, it's easy to crash a
>> C library in Python (OpenCV).
>>
>> On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga <sammiest...@gmail.com>
>> wrote:
>> > Hi sparkers,
>> >
>> > I am working on a PySpark application which uses the OpenCV library. It
>> runs
>> > fine when running the code locally but when I try to run it on Spark on
>> the
>> > same Machine it crashes the worker.
>> >
>> > The code can be found here:
>> > https://gist.github.com/samos123/885f9fe87c8fa5abf78f
>> >
>> > This is the error message taken from STDERR of the worker log:
>> > https://gist.github.com/samos123/3300191684aee7fc8013
>> >
>> > Would like pointers or tips on how to debug further? Would be nice to
>> know
>> > the reason why the worker crashed.
>> >
>> > Thanks,
>> > Sam Stoelinga
>> >
>> >
>> > org.apache.spark.SparkException: Python worker exited unexpectedly
>> (crashed)
>> > at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:172)
>> > at
>> >
>> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
>> > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> > at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> > at
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.io.EOFException
>> > at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> > at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108)
>> >
>> >
>> >
>>
>
>

Reply via email to