Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

Ted Yu Fri, 31 Jul 2015 09:10:50 -0700

Can you call collect() and log the output to get more clue what is left ?

Which Spark release are you using ?


Cheers

On Fri, Jul 31, 2015 at 9:01 AM, Warfish <sebastian.ka...@gmail.com> wrote:

> Hi everyone,
>
> I work with Spark for a little while now and have encountered a strange
> problem that gives me headaches, which has to do with the JavaRDD.subtract
> method. Consider the following piece of code:
>
>     public static void main(String[] args) {
>         //context is of type JavaSparkContext; FILE is the filepath to my
> input file
>         JavaRDD<String> rawTestSet   = context.textFile(FILE);
>         JavaRDD<String> rawTestSet2 = context.textFile(FILE);
>
>         //Gives 0 everytime -> Correct
>         System.out.println("rawTestSetMinusRawTestSet2    = " +
> rawTestSet.subtract(rawTestSet2).count());
>
>         //SearchData is a custom POJO that holds my data
>         JavaRDD<SearchData> testSet      = convert(rawTestSet);
>         JavaRDD<SearchData> testSet2    = convert(rawTestSet);
>         JavaRDD<SearchData> testSet3    = convert(rawTestSet2);
>
>         //These calls give numbers !=0 on cluster mode -> Incorrect
>         System.out.println("testSetMinuesTestSet2         = " +
> testSet.subtract(testSet2).count());
>         System.out.println("testSetMinuesTestSet3         = " +
> testSet.subtract(testSet3).count());
>         System.out.println("testSet2MinuesTestSet3       = " +
> testSet2.subtract(testSet3).count());
>     }
>
>     private static JavaRDD<SearchData> convert(JavaRDD<String> input) {
>         return input.filter(new Matches("myRegex"))
>                          .map(new DoSomething())
>                          .map(new Split("mySplitParam"))
>                          .map(new ToMap())
>                          .map(new Clean())
>                          .map(new ToSearchData());
>     }
>
> In this code, I read a file (usually from HDFS, but applies to disk as
> well)
> and then convert the Strings into custom objects to hold the data using a
> chain of filter- and map-operations. These objects are simple POJOs with
> overriden hashCode() and equal() functions. I then apply the subtract
> method
> to several JavaRDDs that contain exact equal data.
>
> Note: I have omitted the POJO code and the filter- and map-functions to
> make
> the code more concise, but I can post it later if the need arises.
>
> In the main method shown above are several calls of the subtract method,
> all
> of which should give empty RDDs as results because the data in all RDDs
> should be exactly the same. This works for Spark in local mode, however
> when
> executing the code on a cluster the second block of subtract calls does not
> result in empty sets, which tells me that it is a more complicated issue.
> The input data on local and cluster mode was exactly the same.
>
> Can someone shed some light on this issue? I feel like I'm overlooking
> something rather obvious.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

Reply via email to