[jira] [Commented] (SPARK-25474) Size in bytes of the query is coming in EB in case of parquet datasource

2018-09-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622546#comment-16622546
 ] 

Apache Spark commented on SPARK-25474:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22502

> Size in bytes of the query is coming in EB in case of parquet datasource
> 
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25474) Size in bytes of the query is coming in EB in case of parquet datasource

2018-09-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622545#comment-16622545
 ] 

Apache Spark commented on SPARK-25474:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22502

> Size in bytes of the query is coming in EB in case of parquet datasource
> 
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25474) Size in bytes of the query is coming in EB in case of parquet datasource

2018-09-20 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622065#comment-16622065
 ] 

shahid commented on SPARK-25474:


Thanks [~Ayush007], [~sandeep.katta2007], I am working on it. I will raise a PR 
:)

> Size in bytes of the query is coming in EB in case of parquet datasource
> 
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25474) Size in bytes of the query is coming in EB in case of parquet datasource

2018-09-19 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621541#comment-16621541
 ] 

sandeep katta commented on SPARK-25474:
---

okay I would like to check this issue and raise PR

> Size in bytes of the query is coming in EB in case of parquet datasource
> 
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25474) Size in bytes of the query is coming in EB in case of parquet datasource

2018-09-19 Thread Ayush Anubhava (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621539#comment-16621539
 ] 

Ayush Anubhava commented on SPARK-25474:


It behaves the same even if fall back to hdfs is enabled.

 

> Size in bytes of the query is coming in EB in case of parquet datasource
> 
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25474) Size in bytes of the query is coming in EB in case of parquet datasource

2018-09-19 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621535#comment-16621535
 ] 

sandeep katta commented on SPARK-25474:
---

what is the behavior for spark.sql.statistics.fallBackToHdfs = false  ?

> Size in bytes of the query is coming in EB in case of parquet datasource
> 
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org