I didn't realize I do get a nice stack trace if not running in debug mode. 
Basically, I believe Document has to be serializable. 
But since the question has already been asked, are the other requirements for 
objects within an RDD that I should be aware of. serializable is very 
understandable. How about clone, hashCode, etc...

From: ronalday...@live.com
To: user@spark.apache.org
Subject: collecting fails - requirements for collecting (clone, hashCode etc?)
Date: Wed, 3 Dec 2014 07:48:53 -0600




The following code is failing on the collect. If I don't do the collect and go 
with a JavaRDD<Document> it works fine. Except I really would like to collect. 
At first I was getting an error regarding JDI threads and an index being 0. 
Then it just started locking up. I'm running the spark context locally on 8 
cores. 

                long count = documents.filter(d -> d.getFeatures().size() > 
Parameters.MIN_CENTROID_FEATURES).count();          List<Document> 
sampledDocuments = documents.filter(d -> d.getFeatures().size() > 
Parameters.MIN_CENTROID_FEATURES)                              .sample(false, 
samplingFraction(count)).collect();


                                                                                
  

Reply via email to