I've changed the SIFT feature extraction to SURF feature extraction and it works...
Following line was changed: sift = cv2.xfeatures2d.SIFT_create() to sift = cv2.xfeatures2d.SURF_create() Where should I file this as a bug? When not running on Spark it works fine so I'm saying it's a spark bug. On Fri, Jun 5, 2015 at 2:17 PM, Sam Stoelinga <sammiest...@gmail.com> wrote: > Yea should have emphasized that. I'm running the same code on the same VM. > It's a VM with spark in standalone mode and I run the unit test directly on > that same VM. So OpenCV is working correctly on that same machine but when > moving the exact same OpenCV code to spark it just crashes. > > On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu <dav...@databricks.com> wrote: > >> Could you run the single thread version in worker machine to make sure >> that OpenCV is installed and configured correctly? >> >> On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga <sammiest...@gmail.com> >> wrote: >> > I've verified the issue lies within Spark running OpenCV code and not >> within >> > the sequence file BytesWritable formatting. >> > >> > This is the code which can reproduce that spark is causing the failure >> by >> > not using the sequencefile as input at all but running the same function >> > with same input on spark but fails: >> > >> > def extract_sift_features_opencv(imgfile_imgbytes): >> > imgfilename, discardsequencefile = imgfile_imgbytes >> > imgbytes = bytearray(open("/tmp/img.jpg", "rb").read()) >> > nparr = np.fromstring(buffer(imgbytes), np.uint8) >> > img = cv2.imdecode(nparr, 1) >> > gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) >> > sift = cv2.xfeatures2d.SIFT_create() >> > kp, descriptors = sift.detectAndCompute(gray, None) >> > return (imgfilename, "test") >> > >> > And corresponding tests.py: >> > https://gist.github.com/samos123/d383c26f6d47d34d32d6 >> > >> > >> > On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga <sammiest...@gmail.com> >> > wrote: >> >> >> >> Thanks for the advice! The following line causes spark to crash: >> >> >> >> kp, descriptors = sift.detectAndCompute(gray, None) >> >> >> >> But I do need this line to be executed and the code does not crash when >> >> running outside of Spark but passing the same parameters. You're saying >> >> maybe the bytes from the sequencefile got somehow transformed and don't >> >> represent an image anymore causing OpenCV to crash the whole python >> >> executor. >> >> >> >> On Fri, May 29, 2015 at 2:06 AM, Davies Liu <dav...@databricks.com> >> wrote: >> >>> >> >>> Could you try to comment out some lines in >> >>> `extract_sift_features_opencv` to find which line cause the crash? >> >>> >> >>> If the bytes came from sequenceFile() is broken, it's easy to crash a >> >>> C library in Python (OpenCV). >> >>> >> >>> On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga <sammiest...@gmail.com >> > >> >>> wrote: >> >>> > Hi sparkers, >> >>> > >> >>> > I am working on a PySpark application which uses the OpenCV >> library. It >> >>> > runs >> >>> > fine when running the code locally but when I try to run it on >> Spark on >> >>> > the >> >>> > same Machine it crashes the worker. >> >>> > >> >>> > The code can be found here: >> >>> > https://gist.github.com/samos123/885f9fe87c8fa5abf78f >> >>> > >> >>> > This is the error message taken from STDERR of the worker log: >> >>> > https://gist.github.com/samos123/3300191684aee7fc8013 >> >>> > >> >>> > Would like pointers or tips on how to debug further? Would be nice >> to >> >>> > know >> >>> > the reason why the worker crashed. >> >>> > >> >>> > Thanks, >> >>> > Sam Stoelinga >> >>> > >> >>> > >> >>> > org.apache.spark.SparkException: Python worker exited unexpectedly >> >>> > (crashed) >> >>> > at >> >>> > >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:172) >> >>> > at >> >>> > >> >>> > >> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176) >> >>> > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) >> >>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) >> >>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) >> >>> > at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) >> >>> > at org.apache.spark.scheduler.Task.run(Task.scala:64) >> >>> > at >> >>> > >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) >> >>> > at >> >>> > >> >>> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> >>> > at >> >>> > >> >>> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> >>> > at java.lang.Thread.run(Thread.java:745) >> >>> > Caused by: java.io.EOFException >> >>> > at java.io.DataInputStream.readInt(DataInputStream.java:392) >> >>> > at >> >>> > >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) >> >>> > >> >>> > >> >>> > >> >> >> >> >> > >> > >