[GitHub] spark pull request #21805: [SPARK-24850][SQL] fix str representation of Cach...

onursatici Wed, 18 Jul 2018 08:53:05 -0700

GitHub user onursatici opened a pull request:

    https://github.com/apache/spark/pull/21805


    [SPARK-24850][SQL] fix str representation of CachedRDDBuilder

    ## What changes were proposed in this pull request?
    As of https://github.com/apache/spark/pull/21018, InMemoryRelation includes 
its cacheBuilder when logging query plans. This PR changes the string 
representation of the CachedRDDBuilder to not include the cached spark plan.
    
    ## How was this patch tested?
    
    spark-shell, query:
    ```
    var df_cached = spark.read.format("csv").option("header", 
"true").load("test.csv").cache()
    0 to 1 foreach { _ =>
    df_cached = df_cached.join(spark.read.format("csv").option("header", 
"true").load("test.csv"), "A").cache()
    }
    df_cached.explain
    ```
    as of master results in:
    ```
    == Physical Plan ==
    InMemoryTableScan [A#10, B#11, B#35, B#87]
    +- InMemoryRelation [A#10, B#11, B#35, B#87], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) Project [A#10, B#11, B#35, B#87]
    +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
    :- *(2) Filter isnotnull(A#10)
    : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
    : +- InMemoryRelation [A#10, B#11, B#35], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) Project [A#10, B#11, B#35]
    +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
    :- *(2) Filter isnotnull(A#10)
    : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
    : +- InMemoryRelation [A#10, B#11], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
    +- *(1) Filter isnotnull(A#34)
    +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
    +- InMemoryRelation [A#34, B#35], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : +- *(2) Project [A#10, B#11, B#35]
    : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
    : :- *(2) Filter isnotnull(A#10)
    : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
    : : +- InMemoryRelation [A#10, B#11], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
    : +- *(1) Filter isnotnull(A#34)
    : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
    : +- InMemoryRelation [A#34, B#35], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
    +- *(1) Filter isnotnull(A#86)
    +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
    +- InMemoryRelation [A#86, B#87], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    +- *(2) Project [A#10, B#11, B#35, B#87]
    +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
    :- *(2) Filter isnotnull(A#10)
    : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
    : +- InMemoryRelation [A#10, B#11, B#35], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) Project [A#10, B#11, B#35]
    +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
    :- *(2) Filter isnotnull(A#10)
    : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
    : +- InMemoryRelation [A#10, B#11], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
    +- *(1) Filter isnotnull(A#34)
    +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
    +- InMemoryRelation [A#34, B#35], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : +- *(2) Project [A#10, B#11, B#35]
    : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
    : :- *(2) Filter isnotnull(A#10)
    : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
    : : +- InMemoryRelation [A#10, B#11], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
    : +- *(1) Filter isnotnull(A#34)
    : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
    : +- InMemoryRelation [A#34, B#35], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
    +- *(1) Filter isnotnull(A#86)
    +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
    +- InMemoryRelation [A#86, B#87], 
CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ,None)
    +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
    ```
    with this patch results in:
    ```
    == Physical Plan ==
    InMemoryTableScan [A#10, B#11, B#35, B#87]
       +- InMemoryRelation [A#10, B#11, B#35, B#87], CachedRDDBuilder(true, 
10000, StorageLevel(disk, memory, deserialized, 1 replicas))
             +- *(2) Project [A#10, B#11, B#35, B#87]
                +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
                   :- *(2) Filter isnotnull(A#10)
                   :  +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
                   :        +- InMemoryRelation [A#10, B#11, B#35], 
CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 
replicas))
                   :              +- *(2) Project [A#10, B#11, B#35]
                   :                 +- *(2) BroadcastHashJoin [A#10], [A#34], 
Inner, BuildRight
                   :                    :- *(2) Filter isnotnull(A#10)
                   :                    :  +- InMemoryTableScan [A#10, B#11], 
[isnotnull(A#10)]
                   :                    :        +- InMemoryRelation [A#10, 
B#11], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 
replicas))
                   :                    :              +- *(1) FileScan csv 
[A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
                   :                    +- BroadcastExchange 
HashedRelationBroadcastMode(List(input[0, string, false]))
                   :                       +- *(1) Filter isnotnull(A#34)
                   :                          +- InMemoryTableScan [A#34, 
B#35], [isnotnull(A#34)]
                   :                                +- InMemoryRelation [A#34, 
B#35], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 
replicas))
                   :                                      +- *(1) FileScan csv 
[A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<A:string,B:string>
                   +- BroadcastExchange 
HashedRelationBroadcastMode(List(input[0, string, false]))
                      +- *(1) Filter isnotnull(A#86)
                         +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
                               +- InMemoryRelation [A#86, B#87], 
CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 
replicas))
                                     +- *(1) FileScan csv [A#10,B#11] Batched: 
false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/onursatici/spark os/inmemoryrelation-str

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21805
    
----
commit 2a49fe4b91875e4f14d4bbeef8459cde8bf9ac26
Author: Onur Satici <osatici@...>
Date:   2018-07-18T15:43:41Z

    fix str representation of CachedRDDBuilder

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21805: [SPARK-24850][SQL] fix str representation of Cach...

Reply via email to