Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

2015-07-31 Thread Sebastian Kalix
Thanks for the quick  reply. I will be unable to collect more data until
Monday though, but I will update the thread accordingly.

I am using Spark 1.4.0. Were there any related issues reported? I wasn't
able to find any, but I may have overlooked something. I have also updated
the original question to include the relevant Java files, maybe the issue
is hidden there somewhere.

Ted Yu  schrieb am Fr., 31. Juli 2015 um 18:09 Uhr:

> Can you call collect() and log the output to get more clue what is left ?
>
> Which Spark release are you using ?
>
> Cheers
>
> On Fri, Jul 31, 2015 at 9:01 AM, Warfish 
> wrote:
>
>> Hi everyone,
>>
>> I work with Spark for a little while now and have encountered a strange
>> problem that gives me headaches, which has to do with the JavaRDD.subtract
>> method. Consider the following piece of code:
>>
>> public static void main(String[] args) {
>> //context is of type JavaSparkContext; FILE is the filepath to my
>> input file
>> JavaRDD rawTestSet   = context.textFile(FILE);
>> JavaRDD rawTestSet2 = context.textFile(FILE);
>>
>> //Gives 0 everytime -> Correct
>> System.out.println("rawTestSetMinusRawTestSet2= " +
>> rawTestSet.subtract(rawTestSet2).count());
>>
>> //SearchData is a custom POJO that holds my data
>> JavaRDD testSet  = convert(rawTestSet);
>> JavaRDD testSet2= convert(rawTestSet);
>> JavaRDD testSet3= convert(rawTestSet2);
>>
>> //These calls give numbers !=0 on cluster mode -> Incorrect
>> System.out.println("testSetMinuesTestSet2 = " +
>> testSet.subtract(testSet2).count());
>> System.out.println("testSetMinuesTestSet3 = " +
>> testSet.subtract(testSet3).count());
>> System.out.println("testSet2MinuesTestSet3   = " +
>> testSet2.subtract(testSet3).count());
>> }
>>
>> private static JavaRDD convert(JavaRDD input) {
>> return input.filter(new Matches("myRegex"))
>>  .map(new DoSomething())
>>  .map(new Split("mySplitParam"))
>>  .map(new ToMap())
>>  .map(new Clean())
>>  .map(new ToSearchData());
>> }
>>
>> In this code, I read a file (usually from HDFS, but applies to disk as
>> well)
>> and then convert the Strings into custom objects to hold the data using a
>> chain of filter- and map-operations. These objects are simple POJOs with
>> overriden hashCode() and equal() functions. I then apply the subtract
>> method
>> to several JavaRDDs that contain exact equal data.
>>
>> Note: I have omitted the POJO code and the filter- and map-functions to
>> make
>> the code more concise, but I can post it later if the need arises.
>>
>> In the main method shown above are several calls of the subtract method,
>> all
>> of which should give empty RDDs as results because the data in all RDDs
>> should be exactly the same. This works for Spark in local mode, however
>> when
>> executing the code on a cluster the second block of subtract calls does
>> not
>> result in empty sets, which tells me that it is a more complicated issue.
>> The input data on local and cluster mode was exactly the same.
>>
>> Can someone shed some light on this issue? I feel like I'm overlooking
>> something rather obvious.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

2015-07-31 Thread Ted Yu
Can you call collect() and log the output to get more clue what is left ?

Which Spark release are you using ?

Cheers

On Fri, Jul 31, 2015 at 9:01 AM, Warfish  wrote:

> Hi everyone,
>
> I work with Spark for a little while now and have encountered a strange
> problem that gives me headaches, which has to do with the JavaRDD.subtract
> method. Consider the following piece of code:
>
> public static void main(String[] args) {
> //context is of type JavaSparkContext; FILE is the filepath to my
> input file
> JavaRDD rawTestSet   = context.textFile(FILE);
> JavaRDD rawTestSet2 = context.textFile(FILE);
>
> //Gives 0 everytime -> Correct
> System.out.println("rawTestSetMinusRawTestSet2= " +
> rawTestSet.subtract(rawTestSet2).count());
>
> //SearchData is a custom POJO that holds my data
> JavaRDD testSet  = convert(rawTestSet);
> JavaRDD testSet2= convert(rawTestSet);
> JavaRDD testSet3= convert(rawTestSet2);
>
> //These calls give numbers !=0 on cluster mode -> Incorrect
> System.out.println("testSetMinuesTestSet2 = " +
> testSet.subtract(testSet2).count());
> System.out.println("testSetMinuesTestSet3 = " +
> testSet.subtract(testSet3).count());
> System.out.println("testSet2MinuesTestSet3   = " +
> testSet2.subtract(testSet3).count());
> }
>
> private static JavaRDD convert(JavaRDD input) {
> return input.filter(new Matches("myRegex"))
>  .map(new DoSomething())
>  .map(new Split("mySplitParam"))
>  .map(new ToMap())
>  .map(new Clean())
>  .map(new ToSearchData());
> }
>
> In this code, I read a file (usually from HDFS, but applies to disk as
> well)
> and then convert the Strings into custom objects to hold the data using a
> chain of filter- and map-operations. These objects are simple POJOs with
> overriden hashCode() and equal() functions. I then apply the subtract
> method
> to several JavaRDDs that contain exact equal data.
>
> Note: I have omitted the POJO code and the filter- and map-functions to
> make
> the code more concise, but I can post it later if the need arises.
>
> In the main method shown above are several calls of the subtract method,
> all
> of which should give empty RDDs as results because the data in all RDDs
> should be exactly the same. This works for Spark in local mode, however
> when
> executing the code on a cluster the second block of subtract calls does not
> result in empty sets, which tells me that it is a more complicated issue.
> The input data on local and cluster mode was exactly the same.
>
> Can someone shed some light on this issue? I feel like I'm overlooking
> something rather obvious.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>