Thanks for the quick reply. I will be unable to collect more data until
Monday though, but I will update the thread accordingly.
I am using Spark 1.4.0. Were there any related issues reported? I wasn't
able to find any, but I may have overlooked something. I have also updated
the original question to include the relevant Java files, maybe the issue
is hidden there somewhere.
Ted Yu schrieb am Fr., 31. Juli 2015 um 18:09 Uhr:
> Can you call collect() and log the output to get more clue what is left ?
>
> Which Spark release are you using ?
>
> Cheers
>
> On Fri, Jul 31, 2015 at 9:01 AM, Warfish
> wrote:
>
>> Hi everyone,
>>
>> I work with Spark for a little while now and have encountered a strange
>> problem that gives me headaches, which has to do with the JavaRDD.subtract
>> method. Consider the following piece of code:
>>
>> public static void main(String[] args) {
>> //context is of type JavaSparkContext; FILE is the filepath to my
>> input file
>> JavaRDD rawTestSet = context.textFile(FILE);
>> JavaRDD rawTestSet2 = context.textFile(FILE);
>>
>> //Gives 0 everytime -> Correct
>> System.out.println("rawTestSetMinusRawTestSet2= " +
>> rawTestSet.subtract(rawTestSet2).count());
>>
>> //SearchData is a custom POJO that holds my data
>> JavaRDD testSet = convert(rawTestSet);
>> JavaRDD testSet2= convert(rawTestSet);
>> JavaRDD testSet3= convert(rawTestSet2);
>>
>> //These calls give numbers !=0 on cluster mode -> Incorrect
>> System.out.println("testSetMinuesTestSet2 = " +
>> testSet.subtract(testSet2).count());
>> System.out.println("testSetMinuesTestSet3 = " +
>> testSet.subtract(testSet3).count());
>> System.out.println("testSet2MinuesTestSet3 = " +
>> testSet2.subtract(testSet3).count());
>> }
>>
>> private static JavaRDD convert(JavaRDD input) {
>> return input.filter(new Matches("myRegex"))
>> .map(new DoSomething())
>> .map(new Split("mySplitParam"))
>> .map(new ToMap())
>> .map(new Clean())
>> .map(new ToSearchData());
>> }
>>
>> In this code, I read a file (usually from HDFS, but applies to disk as
>> well)
>> and then convert the Strings into custom objects to hold the data using a
>> chain of filter- and map-operations. These objects are simple POJOs with
>> overriden hashCode() and equal() functions. I then apply the subtract
>> method
>> to several JavaRDDs that contain exact equal data.
>>
>> Note: I have omitted the POJO code and the filter- and map-functions to
>> make
>> the code more concise, but I can post it later if the need arises.
>>
>> In the main method shown above are several calls of the subtract method,
>> all
>> of which should give empty RDDs as results because the data in all RDDs
>> should be exactly the same. This works for Spark in local mode, however
>> when
>> executing the code on a cluster the second block of subtract calls does
>> not
>> result in empty sets, which tells me that it is a more complicated issue.
>> The input data on local and cluster mode was exactly the same.
>>
>> Can someone shed some light on this issue? I feel like I'm overlooking
>> something rather obvious.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>