[GitHub] spark pull request: [SPARK-1817] RDD.zip() should verify partition...

kanzhang Sun, 18 May 2014 18:48:13 -0700

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/760#issuecomment-43460411
  
    @witgo your approach is similar to how Range is partitioned (i.e., using 
the ```step``` value to recalculate some of the elements in the sequence). One 
issue with using this approach on Double is the recalculated elements may not 
have the exact values as those in the original sequence due to rounding error 
(see below). The original approach used to partition NumericRange doesn't 
recalculate elements; it simply slicing it using ```take``` and ```drop```.  
While the implementation of ```take``` and ```drop``` may, in turn, involve 
calculations on ```step```, I think it is tricky to get the precision right and 
that's why we have this bug in Scala. IMHO, if we were to fix it we should fix 
it in Scala. @mateiz? 
    
    Partitioning a Double sequence using your above approach.
    ```
    scala> (1D to 2D).by(0.2)
    res0: scala.collection.immutable.NumericRange[Double] = NumericRange(1.0, 
1.2, 1.4, 1.5999999999999999, 1.7999999999999998, 1.9999999999999998)
    
    scala> sc.parallelize((1D to 2D).by(0.2),4).collect 
    res1: Array[Double] = Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1817] RDD.zip() should verify partition...

Reply via email to