[jira] [Updated] (SPARK-38565) Support Left Semi join in row level runtime filters

2022-03-16 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-38565:

Description: 
Support Left Semi join in the runtime filtering as well.

This is a follow up to https://issues.apache.org/jira/browse/SPARK-32268 once 
[https://github.com/apache/spark/pull/35789] is merged.

  was:
This is a follow up to https://issues.apache.org/jira/browse/SPARK-32268 once 
[https://github.com/apache/spark/pull/35789] is merged.

 

Support Left Semi join in the runtime filtering as well.


> Support Left Semi join in row level runtime filters
> ---
>
> Key: SPARK-38565
> URL: https://issues.apache.org/jira/browse/SPARK-38565
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Abhishek Somani
>Priority: Major
>
> Support Left Semi join in the runtime filtering as well.
> This is a follow up to https://issues.apache.org/jira/browse/SPARK-32268 once 
> [https://github.com/apache/spark/pull/35789] is merged.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38565) Support Left Semi join in row level runtime filters

2022-03-16 Thread Abhishek Somani (Jira)
Abhishek Somani created SPARK-38565:
---

 Summary: Support Left Semi join in row level runtime filters
 Key: SPARK-38565
 URL: https://issues.apache.org/jira/browse/SPARK-38565
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Abhishek Somani


This is a follow up to https://issues.apache.org/jira/browse/SPARK-32268 once 
[https://github.com/apache/spark/pull/35789] is merged.

 

Support Left Semi join in the runtime filtering as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32268) Bloom Filter Join

2022-03-09 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503876#comment-17503876
 ] 

Abhishek Somani commented on SPARK-32268:
-

We have created a Design doc and PR for this

Design doc: 
[https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit#]

PR: https://github.com/apache/spark/pull/35789

> Bloom Filter Join
> -
>
> Key: SPARK-32268
> URL: https://issues.apache.org/jira/browse/SPARK-32268
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: q16-bloom-filter.jpg, q16-default.jpg
>
>
> We can improve the performance of some joins by pre-filtering one side of a 
> join using a Bloom filter and IN predicate generated from the values from the 
> other side of the join.
>  For 
> example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql].
>  [Before this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg].
>  [After this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg].
> *Query Performance Benchmarks: TPC-DS Performance Evaluation*
>  Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and 
> Partitioned Parquet table
>  
> |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)|
> |tpcds q16|84|46|
> |tpcds q36|29|21|
> |tpcds q57|39|28|
> |tpcds q94|42|34|
> |tpcds q95|306|288|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-37199:

Description: 
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details[ in this 
document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].

  was:
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details in this document.


> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details[ in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437514#comment-17437514
 ] 

Abhishek Somani commented on SPARK-37199:
-

Will add a PR soon.

> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)
Abhishek Somani created SPARK-37199:
---

 Summary: Add a deterministic field to QueryPlan
 Key: SPARK-37199
 URL: https://issues.apache.org/jira/browse/SPARK-37199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Abhishek Somani


We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details in this document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-37199:

Description: 
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details [in this 
document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].

  was:
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details[ in this 
document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].


> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430105#comment-17430105
 ] 

Abhishek Somani commented on SPARK-37046:
-

I'll raise a PR soon

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-37046:

Shepherd: Wenchen Fan

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Abhishek Somani (Jira)
Abhishek Somani created SPARK-37046:
---

 Summary: Alter view does not preserve column case
 Key: SPARK-37046
 URL: https://issues.apache.org/jira/browse/SPARK-37046
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Abhishek Somani


On running an `alter view` command, the column case is not preserved.

Repro:

 
{code:java}
scala> sql("create view v as select 1 as A, 1 as B")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("describe v").collect.foreach(println)
[A,int,null]
[B,int,null]

scala> sql("alter view v as select 1 as C, 1 as D")
res4: org.apache.spark.sql.DataFrame = []

scala> sql("describe v").collect.foreach(println)
[c,int,null]
[d,int,null]
 
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16996) Hive ACID delta files not seen

2019-12-18 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999454#comment-16999454
 ] 

Abhishek Somani edited comment on SPARK-16996 at 12/18/19 7:43 PM:
---

 [~SandhyaMora] We are extending the work we did here: 

[https://github.com/qubole/spark-acid] 

... to write Hive ACID tables from Spark. The patch 
[https://github.com/qubole/spark-acid/pull/30] is up for review, and will be 
released soon.


was (Author: asomani):
 [~SandhyaMora] We are [extending the work we did here | 
[https://github.com/qubole/spark-acid]] to write Hive ACID tables from Spark. 
The patch [https://github.com/qubole/spark-acid/pull/30] is up for review, and 
will be released soon.

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Major
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16996) Hive ACID delta files not seen

2019-12-18 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999454#comment-16999454
 ] 

Abhishek Somani edited comment on SPARK-16996 at 12/18/19 7:42 PM:
---

 [~SandhyaMora] We are [extending the work we did here | 
[https://github.com/qubole/spark-acid]] to write Hive ACID tables from Spark. 
The patch [https://github.com/qubole/spark-acid/pull/30] is up for review, and 
will be released soon.


was (Author: asomani):
 [~SandhyaMora] We are [extending the work we did 
here|[https://github.com/qubole/spark-acid]] to write Hive ACID tables from 
Spark. The patch [https://github.com/qubole/spark-acid/pull/30] is up for 
review, and will be released soon.

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Major
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16996) Hive ACID delta files not seen

2019-12-18 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999454#comment-16999454
 ] 

Abhishek Somani commented on SPARK-16996:
-

 [~SandhyaMora] We are [extending the work we did 
here|[https://github.com/qubole/spark-acid]] to write Hive ACID tables from 
Spark. The patch [https://github.com/qubole/spark-acid/pull/30] is up for 
review, and will be released soon.

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Major
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2019-10-23 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957904#comment-16957904
 ] 

Abhishek Somani commented on SPARK-15348:
-

[~Kelvin.FE] This seems to be happening because you might have 
"hive.strict.managed.tables" set to true on the hive metastore server. You can 
either try setting it to false or running the above query as "create external 
table test.cars ... " instead of "create table"

If you still face an issue or have more questions, please feel free to open an 
issue at [https://github.com/qubole/spark-acid/issues]

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16996) Hive ACID delta files not seen

2019-07-26 Thread Abhishek Somani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893807#comment-16893807
 ] 

Abhishek Somani commented on SPARK-16996:
-

We have worked on and open sourced a datasource that will enable users to work 
on their Hive ACID Transactional tables using Spark. 
 
Github: [https://github.com/qubole/spark-acid]
 
It is available as a Spark package and instructions to use it are on the Github 
page. Currently the datasource supports reading from Hive ACID tables only, and 
we are working on adding the ability to write into these tables via Spark as 
well.
 
Feedback and suggestions are welcome!

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Major
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2019-07-26 Thread Abhishek Somani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893806#comment-16893806
 ] 

Abhishek Somani commented on SPARK-15348:
-

We have worked on and open sourced a datasource that will enable users to work 
on their Hive ACID Transactional tables using Spark. 
 
Github: [https://github.com/qubole/spark-acid]
 
It is available as a Spark package and instructions to use it are on the Github 
page. Currently the datasource supports only reading from Hive ACID tables, and 
we are working on adding the ability to write into these tables via Spark as 
well.
 
Feedback and suggestions are welcome!

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org