from:"Jacek Laskowski \(Jira\)"

[jira] [Updated] (SPARK-43947) Incorrect SparkException when missing config in resources in Stage-Level Scheduling

2023-06-02 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-43947:

Summary: Incorrect SparkException when missing config in resources in 
Stage-Level Scheduling  (was: Incorrect SparkException when missing amount in 
resources in Stage-Level Scheduling)

> Incorrect SparkException when missing config in resources in Stage-Level 
> Scheduling
> ---
>
> Key: SPARK-43947
> URL: https://issues.apache.org/jira/browse/SPARK-43947
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.4.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> [ResourceUtils.listResourceIds|https://github.com/apache/spark/blob/807abf9c53ee8c1c7ef69646ebd8a266f60d5580/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala#L152-L155]
>  can throw an exception for any missing config, not just `amount`.
> {code:scala}
>   val index = key.indexOf('.')
>   if (index < 0) {
> throw new SparkException(s"You must specify an amount config for 
> resource: $key " +
>   s"config: $componentName.$RESOURCE_PREFIX.$key")
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43947) Incorrect SparkException when missing amount in resources in Stage-Level Scheduling

2023-06-02 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-43947:
---

 Summary: Incorrect SparkException when missing amount in resources 
in Stage-Level Scheduling
 Key: SPARK-43947
 URL: https://issues.apache.org/jira/browse/SPARK-43947
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.4.0
Reporter: Jacek Laskowski


[ResourceUtils.listResourceIds|https://github.com/apache/spark/blob/807abf9c53ee8c1c7ef69646ebd8a266f60d5580/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala#L152-L155]
 can throw an exception for any missing config, not just `amount`.

{code:scala}
  val index = key.indexOf('.')
  if (index < 0) {
throw new SparkException(s"You must specify an amount config for 
resource: $key " +
  s"config: $componentName.$RESOURCE_PREFIX.$key")
  }
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43912) Incorrect SparkException for Stage-Level Scheduling in local mode

2023-06-01 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-43912:
---

 Summary: Incorrect SparkException for Stage-Level Scheduling in 
local mode
 Key: SPARK-43912
 URL: https://issues.apache.org/jira/browse/SPARK-43912
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.4.0
 Environment: ```text
scala> println(spark.version)
3.4.0

scala> println(sc.master)
local[*]
```
Reporter: Jacek Laskowski


While in `local[*]` mode, the following `SparkException` is thrown:

```text
org.apache.spark.SparkException: TaskResourceProfiles are only supported for 
Standalone cluster for now when dynamic allocation is disabled.
  at 
org.apache.spark.resource.ResourceProfileManager.isSupported(ResourceProfileManager.scala:71)
  at 
org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:126)
  at org.apache.spark.rdd.RDD.withResources(RDD.scala:1802)
  ... 42 elided
```

This happens for the following snippet:

```scala
val rdd = sc.range(0, 9)

import org.apache.spark.resource.ResourceProfileBuilder
val rpb = new ResourceProfileBuilder
val rp1 = rpb.build()

rdd.withResources(rp1)
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43152) User-defined output metadata path (_spark_metadata)

2023-04-20 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-43152:

Summary: User-defined output metadata path (_spark_metadata)  (was: 
Parametrisable output metadata path (_spark_metadata))

> User-defined output metadata path (_spark_metadata)
> ---
>
> Key: SPARK-43152
> URL: https://issues.apache.org/jira/browse/SPARK-43152
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Wojciech Indyk
>Priority: Major
>
> Currently path of metadata of output checkpoint is hardcoded. The metadata is 
> saved in output path in _spark_metadata folder. It's a constraint on 
> structure of paths, that might be easily relaxed by parametrisable path of 
> output metadata. It would help with issues like [changing output directory of 
> spark streaming 
> job|https://kb.databricks.com/en_US/streaming/file-sink-streaming], [two jobs 
> writing to the same output 
> path|https://issues.apache.org/jira/browse/SPARK-30542] or [partition 
> discovery|https://stackoverflow.com/questions/61904732/is-it-possible-to-change-location-of-spark-metadata-folder-in-spark-structured/61905158].
>  It would also help with separation of metadata from data in path structure.
> The main target of change is getMetadataLogPath method in FileStreamSink. It 
> has got access to sqlConf, so this method can override the default 
> _spark_metadata path if defined it config. Introduction of parametrised 
> metadata path needs reconsidering of meaning of  hasMetadata method in 
> FileStreamSink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42977) spark sql Disable vectorized faild

2023-03-31 Thread Jacek Laskowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707256#comment-17707256
 ] 

Jacek Laskowski commented on SPARK-42977:
-

Unless you can reproduce it without Iceberg, it's probably an Iceberg issue and 
should be reported in https://github.com/apache/iceberg/issues.

> spark sql Disable vectorized  faild
> ---
>
> Key: SPARK-42977
> URL: https://issues.apache.org/jira/browse/SPARK-42977
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: liu
>Priority: Major
> Fix For: 3.3.2
>
>
> spark-sql config
> {code:java}
> ./spark-sql --packages 
> org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0\
>     --conf   
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>  \
>     --conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
>     --conf spark.sql.catalog.spark_catalog.type=hive \
>     --conf spark.sql.iceberg.handle-timestamp-without-timezone=true \
>     --conf spark.sql.parquet.binaryAsString=true \
>     --conf spark.sql.parquet.enableVectorizedReader=false \
>     --conf spark.sql.parquet.enableNestedColumnVectorizedReader=true \
>     --conf spark.sql.parquet.recordLevelFilter=true  {code}
>  
> Now that I have configured spark. sql. queue. 
> enableVectorizedReader=false,but i query a iceberg parquet table，the 
> following error occurred：
>  
>    
> {code:java}
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)     at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:498)
>      at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:286)
>      at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:498)     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)  
>    at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
>      at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)     at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)     at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)  
>    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)     
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: 
> java.lang.UnsupportedOperationException: Cannot support vectorized reads for 
> column [hzxm] optional binary hzxm = 8 with encoding DELTA_BYTE_ARRAY. 
> Disable vectorized reads to read this table/file     at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:100)
>      at 
> org.apache.iceberg.parquet.BasePageIterator.initFromPage(BasePageIterator.java:140)
>      at 
> org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:105)
>      at 
> org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:96) 
>     at 
> org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
>      at 
> org.apache.iceberg.parquet.BasePageIterator.setPage(BasePageIterator.java:95) 
>     at 
> org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:61)
>      at 
> org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:50)
>      at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.setRowGroupInfo(Vec
>  {code}
>  
>  
> *{color:#FF}Caused by: java.lang.UnsupportedOperationException: Cannot 
> support vectorized reads for column [hzxm] optional binary hzxm = 8 with 
> encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this 
> table/file{color}*
>  
>  
> Now it seems that this parameter has not worked. How can I turn off this 
> function so that I can successfully query the table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42496) Introducting Spark Connect at main page

2023-03-14 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-42496:

Summary: Introducting Spark Connect at main page  (was: Introduction Spark 
Connect at main page)

> Introducting Spark Connect at main page
> ---
>
> Key: SPARK-42496
> URL: https://issues.apache.org/jira/browse/SPARK-42496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We should document the introduction of Spark Connect at PySpark main 
> documentation page to give a summary to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40821) Fix late record filtering to support chaining of stateful operators

2022-10-25 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-40821:

Summary: Fix late record filtering to support chaining of stateful 
operators  (was: Fix late record filtering to support chaining of steteful 
operators)

> Fix late record filtering to support chaining of stateful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40807) "RocksDB: commit - pause bg time total" metric always 0

2022-10-15 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-40807:

Summary: "RocksDB: commit - pause bg time total" metric always 0  (was: 
RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric)

> "RocksDB: commit - pause bg time total" metric always 0
> ---
>
> Key: SPARK-40807
> URL: https://issues.apache.org/jira/browse/SPARK-40807
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-streams-commit-pause-bg-time.png
>
>
> {{RocksDBStateStore}} uses 
> [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131]
>  key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} 
> uses 
> [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308]
>  to provide a value every commit. That leads to a name mismatch and 0 
> reported (as the default value for no metrics).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40807) RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric

2022-10-15 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-40807:

Summary: RocksDBStateStore always 0 for "RocksDB: commit - pause bg time 
total" metric  (was: RocksDBStateStore always 0 for pause bg time total metric)

> RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric
> -
>
> Key: SPARK-40807
> URL: https://issues.apache.org/jira/browse/SPARK-40807
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-streams-commit-pause-bg-time.png
>
>
> {{RocksDBStateStore}} uses 
> [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131]
>  key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} 
> uses 
> [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308]
>  to provide a value every commit. That leads to a name mismatch and 0 
> reported (as the default value for no metrics).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40807) RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric

2022-10-15 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-40807:

Attachment: spark-streams-commit-pause-bg-time.png

> RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric
> -
>
> Key: SPARK-40807
> URL: https://issues.apache.org/jira/browse/SPARK-40807
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-streams-commit-pause-bg-time.png
>
>
> {{RocksDBStateStore}} uses 
> [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131]
>  key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} 
> uses 
> [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308]
>  to provide a value every commit. That leads to a name mismatch and 0 
> reported (as the default value for no metrics).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40807) RocksDBStateStore always 0 for pause bg time total metric

2022-10-15 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-40807:
---

 Summary: RocksDBStateStore always 0 for pause bg time total metric
 Key: SPARK-40807
 URL: https://issues.apache.org/jira/browse/SPARK-40807
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Jacek Laskowski


{{RocksDBStateStore}} uses 
[pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131]
 key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} uses 
[pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308]
 to provide a value every commit. That leads to a name mismatch and 0 reported 
(as the default value for no metrics).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2022-05-08 Thread Jacek Laskowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533465#comment-17533465
 ] 

Jacek Laskowski commented on SPARK-17556:
-

Given:
 # "I'm running a large query with over 100,000 tasks."
 # "Total size of serialized results ... is bigger than 
spark.driver.maxResultSize".

I think the issue is no a broadcast join but the size of the result (as 
computed by these 100k tasks). They have to report back to the driver and I 
can't think of a reason why a broadcast join would make it any worse? I must be 
missing something obvious (and chimed in to learn a bit about Spark SQL from 
you today :))

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Priority: Major
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH

2021-10-17 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski resolved SPARK-36904.
-
Resolution: Invalid

I finally managed to find the root cause of the issue which is 
{{conf/hive-site.xml}} in {{HIVE_HOME}} with the driver configured (!) Sorry 
for a false alarm.

> The specified datastore driver ("org.postgresql.Driver") was not found in the 
> CLASSPATH
> ---
>
> Key: SPARK-36904
> URL: https://issues.apache.org/jira/browse/SPARK-36904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Spark 3.2.0 (RC6)
> {code:java}
> $ ./bin/spark-shell --version 
>   
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
> Branch heads/v3.2.0-rc6
> Compiled by user jacek on 2021-09-30T10:44:35Z
> Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
> Url https://github.com/apache/spark.git
> Type --help for more information.
> {code}
> Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
> following command:
> {code:java}
> $ ./build/mvn \
> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
> -DskipTests \
> clean install
> {code}
> {code:java}
> $ java -version
> openjdk version "11.0.12" 2021-07-20
> OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
> OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) 
> {code}
>Reporter: Jacek Laskowski
>Priority: Critical
> Attachments: exception.txt
>
>
> It looks similar to [hivethriftserver built into spark3.0.0. is throwing 
> error "org.postgresql.Driver" was not found in the 
> CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here 
> for future reference.
> After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe 
> table covid_19")`. That gave me the exception (a full version is attached):
> {code}
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" 
> plugin to create a ConnectionPool gave an error : The specified datastore 
> driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check 
> your CLASSPATH specification, and the name of the driver.
>   at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285)
>   at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)
>   at 
> org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817)
>   ... 171 more
> Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the 
> "BONECP" plugin to create a ConnectionPool gave an error : The specified 
> datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. 
> Please check your CLASSPATH specification, and the name of the driver.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117)
>   at 
> org.d

[jira] [Commented] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH

2021-09-30 Thread Jacek Laskowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422784#comment-17422784
 ] 

Jacek Laskowski commented on SPARK-36904:
-

The table does not exist. I simply tried to execute a random SQL command.

I'll give the binary tarball a go instead but there could be a difference in 
how it was built (what profiles were used). I'll investigate further. Thanks.

> The specified datastore driver ("org.postgresql.Driver") was not found in the 
> CLASSPATH
> ---
>
> Key: SPARK-36904
> URL: https://issues.apache.org/jira/browse/SPARK-36904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Spark 3.2.0 (RC6)
> {code:java}
> $ ./bin/spark-shell --version 
>   
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
> Branch heads/v3.2.0-rc6
> Compiled by user jacek on 2021-09-30T10:44:35Z
> Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
> Url https://github.com/apache/spark.git
> Type --help for more information.
> {code}
> Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
> following command:
> {code:java}
> $ ./build/mvn \
> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
> -DskipTests \
> clean install
> {code}
> {code:java}
> $ java -version
> openjdk version "11.0.12" 2021-07-20
> OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
> OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) 
> {code}
>Reporter: Jacek Laskowski
>Priority: Critical
> Attachments: exception.txt
>
>
> It looks similar to [hivethriftserver built into spark3.0.0. is throwing 
> error "org.postgresql.Driver" was not found in the 
> CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here 
> for future reference.
> After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe 
> table covid_19")`. That gave me the exception (a full version is attached):
> {code}
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" 
> plugin to create a ConnectionPool gave an error : The specified datastore 
> driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check 
> your CLASSPATH specification, and the name of the driver.
>   at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285)
>   at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)
>   at 
> org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817)
>   ... 171 more
> Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the 
> "BONECP" plugin to create a ConnectionPool gave an error : The specified 
> datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. 
> Please check your CLASSPATH specification, and the name of the driver.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
>   at 
> org.datanucleus.st

[jira] [Commented] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH

2021-09-30 Thread Jacek Laskowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422727#comment-17422727
 ] 

Jacek Laskowski commented on SPARK-36904:
-

The solution from [this answer on 
SO|https://stackoverflow.com/a/62588338/1305344] didn't work for me and am 
still facing the exception.

{code}
./build/mvn \
  -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
  -DskipTests \
  dependency:purge-local-repository \
  clean install
{code}

{code}
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the 
"BONECP" plugin to create a ConnectionPool gave an error : The specified 
datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. 
Please check your CLASSPATH specification, and the name of the driver.
at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117)
at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:82)
... 190 more
Caused by: 
org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: 
The specified datastore driver ("org.postgresql.Driver") was not found in the 
CLASSPATH. Please check your CLASSPATH specification, and the name of the 
driver.
at 
org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
at 
org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:213)
... 192 more
{code}

> The specified datastore driver ("org.postgresql.Driver") was not found in the 
> CLASSPATH
> ---
>
> Key: SPARK-36904
> URL: https://issues.apache.org/jira/browse/SPARK-36904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Spark 3.2.0 (RC6)
> {code:java}
> $ ./bin/spark-shell --version 
>   
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
> Branch heads/v3.2.0-rc6
> Compiled by user jacek on 2021-09-30T10:44:35Z
> Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
> Url https://github.com/apache/spark.git
> Type --help for more information.
> {code}
> Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
> following command:
> {code:java}
> $ ./build/mvn \
> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
> -DskipTests \
> clean install
> {code}
> {code:java}
> $ java -version
> openjdk version "11.0.12" 2021-07-20
> OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
> OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) 
> {code}
>Reporter: Jacek Laskowski
>Priority: Critical
> Attachments: exception.txt
>
>
> It looks similar to [hivethriftserver built into spark3.0.0. is throwing 
> error "org.postgresql.Driver" was not found in the 
> CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here 
> for future reference.
> After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe 
> table covid_19")`. That gave me the exception (a full version is attached):
> {code}
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" 
> plugin to create a ConnectionPool gave an error : The specified datastore 
> driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check 
> your CLASSPATH specification, and the name of the driver.
>   at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162)
>   at 
> org.data

[jira] [Updated] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH

2021-09-30 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-36904:

Attachment: exception.txt

> The specified datastore driver ("org.postgresql.Driver") was not found in the 
> CLASSPATH
> ---
>
> Key: SPARK-36904
> URL: https://issues.apache.org/jira/browse/SPARK-36904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Spark 3.2.0 (RC6)
> {code:java}
> $ ./bin/spark-shell --version 
>   
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
> Branch heads/v3.2.0-rc6
> Compiled by user jacek on 2021-09-30T10:44:35Z
> Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
> Url https://github.com/apache/spark.git
> Type --help for more information.
> {code}
> Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
> following command:
> {code:java}
> $ ./build/mvn \
> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
> -DskipTests \
> clean install
> {code}
> {code:java}
> $ java -version
> openjdk version "11.0.12" 2021-07-20
> OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
> OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) 
> {code}
>Reporter: Jacek Laskowski
>Priority: Critical
> Attachments: exception.txt
>
>
> It looks similar to [hivethriftserver built into spark3.0.0. is throwing 
> error "org.postgresql.Driver" was not found in the 
> CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here 
> for future reference.
> After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe 
> table covid_19")`. That gave me the exception (a full version is attached):
> {code}
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" 
> plugin to create a ConnectionPool gave an error : The specified datastore 
> driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check 
> your CLASSPATH specification, and the name of the driver.
>   at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285)
>   at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)
>   at 
> org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817)
>   ... 171 more
> Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the 
> "BONECP" plugin to create a ConnectionPool gave an error : The specified 
> datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. 
> Please check your CLASSPATH specification, and the name of the driver.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:82)
>   ... 187 more
> Caused by: 
> org.datanucleus.store.rdbms.connectionpool.Datas

[jira] [Updated] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH

2021-09-30 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-36904:

Environment: 
Spark 3.2.0 (RC6)
{code:java}
$ ./bin/spark-shell --version   
   
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
  /_/

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
Branch heads/v3.2.0-rc6
Compiled by user jacek on 2021-09-30T10:44:35Z
Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
Url https://github.com/apache/spark.git
Type --help for more information.
{code}
Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
following command:
{code:java}
$ ./build/mvn \
-Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
-DskipTests \
clean install
{code}
{code:java}
$ java -version
openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) {code}

  was:
{code:java}
$ ./bin/spark-shell --version   
   
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
  /_/

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
Branch heads/v3.2.0-rc6
Compiled by user jacek on 2021-09-30T10:44:35Z
Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
Url https://github.com/apache/spark.git
Type --help for more information.
{code}
Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
following command:
{code:java}
$ ./build/mvn \
-Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
-DskipTests \
clean install
{code}
{code:java}
$ java -version
openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) {code}


> The specified datastore driver ("org.postgresql.Driver") was not found in the 
> CLASSPATH
> ---
>
> Key: SPARK-36904
> URL: https://issues.apache.org/jira/browse/SPARK-36904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Spark 3.2.0 (RC6)
> {code:java}
> $ ./bin/spark-shell --version 
>   
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
> Branch heads/v3.2.0-rc6
> Compiled by user jacek on 2021-09-30T10:44:35Z
> Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
> Url https://github.com/apache/spark.git
> Type --help for more information.
> {code}
> Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
> following command:
> {code:java}
> $ ./build/mvn \
> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
> -DskipTests \
> clean install
> {code}
> {code:java}
> $ java -version
> openjdk version "11.0.12" 2021-07-20
> OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
> OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) 
> {code}
>Reporter: Jacek Laskowski
>Priority: Critical
>
> It looks similar to [hivethriftserver built into spark3.0.0. is throwing 
> error "org.postgresql.Driver" was not found in the 
> CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here 
> for future reference.
> After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe 
> table covid_19")`. That gave me the exception (a full version is attached):
> {code}
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" 
> plugin to create a ConnectionPool gave an error : The specified datastore 
> driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check 
> your CLASSPATH specification, and the name of the driver.
>   at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown 
> Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExte

[jira] [Created] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH

2021-09-30 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-36904:
---

 Summary: The specified datastore driver ("org.postgresql.Driver") 
was not found in the CLASSPATH
 Key: SPARK-36904
 URL: https://issues.apache.org/jira/browse/SPARK-36904
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
 Environment: {code:java}
$ ./bin/spark-shell --version   
   
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
  /_/

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12
Branch heads/v3.2.0-rc6
Compiled by user jacek on 2021-09-30T10:44:35Z
Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6
Url https://github.com/apache/spark.git
Type --help for more information.
{code}
Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the 
following command:
{code:java}
$ ./build/mvn \
-Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
-DskipTests \
clean install
{code}
{code:java}
$ java -version
openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7)
OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) {code}
Reporter: Jacek Laskowski


It looks similar to [hivethriftserver built into spark3.0.0. is throwing error 
"org.postgresql.Driver" was not found in the 
CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here for 
future reference.

After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe 
table covid_19")`. That gave me the exception (a full version is attached):

{code}
Caused by: java.lang.reflect.InvocationTargetException: 
org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" 
plugin to create a ConnectionPool gave an error : The specified datastore 
driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check 
your CLASSPATH specification, and the name of the driver.
  at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown 
Source)
  at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
  at 
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
  at 
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330)
  at 
org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203)
  at 
org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162)
  at 
org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285)
  at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown 
Source)
  at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
  at 
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
  at 
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
  at 
org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)
  at 
org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422)
  at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817)
  ... 171 more
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the 
"BONECP" plugin to create a ConnectionPool gave an error : The specified 
datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. 
Please check your CLASSPATH specification, and the name of the driver.
  at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
  at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117)
  at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:82)
  ... 187 more
Caused by: 
org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: 
The specified datastore driver ("org.postgresql.Driver") was not found in the 
CLASSPATH. Please check your CLASSPATH specification, and the name of the 
driver.
  at 
org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
  at 
org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
  at 
org.datanucleus.store.rdbms.ConnectionF

[jira] [Resolved] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook

2021-02-04 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski resolved SPARK-34351.
-
Resolution: Invalid

Please use StackOverflow or the user@spark.a.o mailing list to ask this 
question (as described in [http://spark.apache.org/community.html]. See you 
there!

> Running into "Py4JJavaError" while counting to text file or list using 
> Pyspark, Jupyter notebook
> 
>
> Key: SPARK-34351
> URL: https://issues.apache.org/jira/browse/SPARK-34351
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: PS> python --version
>  *Python 3.6.8*
> PS> jupyter --version
>  j*upyter core : 4.7.0*
>  *jupyter-notebook : 6.2.0*
>  qtconsole : 5.0.2
>  ipython : 7.16.1
>  ipykernel : 5.4.3
>  jupyter client : 6.1.11
>  jupyter lab : not installed
>  nbconvert : 6.0.7
>  ipywidgets : 7.6.3
>  nbformat : 5.1.2
>  traitlets : 4.3.3
> PS > java -version
>  *java version "1.8.0_271"*
>  Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)
>  
> Spark versiyon
> *spark-2.3.1-bin-hadoop2.7*
>Reporter: Huseyin Elci
>Priority: Major
>
> I run into the following error: 
>  Any help resolving this error is greatly appreciated.
>  *My Code 1:*
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark.sql import SparkSession
> from pyspark.conf import SparkConf
> spark = SparkSession.builder\
> .master("local[4]")\
> .appName("WordCount_RDD")\
> .getOrCreate()
> sc = spark.sparkContext
> data = "D:\\05 Spark\\data\\MyArticle.txt"
> story_rdd = sc.textFile(data)
> story_rdd.count()
> {code}
> *My Code 2:* 
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark import SparkContext
> sc = SparkContext()
> mylist = [1,2,2,3,5,48,98,62,14,55]
> mylist_rdd = sc.parallelize(mylist)
> mylist_rdd.map(lambda x: x*x)
> mylist_rdd.map(lambda x: x*x).collect()
> {code}
> *ERROR:*
> I took same error code for my codes.
> {code:python}
>  ---
>  Py4JJavaError Traceback (most recent call last)
>   in 
>  > 1 story_rdd.count()
> C:\Spark\python\pyspark\rdd.py in count(self)
>  1071 3
>  1072 """
>  -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>  1074 
>  1075 def stats(self):
> C:\Spark\python\pyspark\rdd.py in sum(self)
>  1062 6.0
>  1063 """
>  -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>  1065 
>  1066 def count(self):
> C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op)
>  933 # zeroValue provided to each partition is unique from the one provided
>  934 # to the final reduce call
>  --> 935 vals = self.mapPartitions(func).collect()
>  936 return reduce(op, vals, zeroValue)
>  937
> C:\Spark\python\pyspark\rdd.py in collect(self)
>  832 """
>  833 with SCCallSiteSync(self.context) as css:
>  --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  835 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
>  836
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in 
> __call__(self, *args)
>  1255 answer = self.gateway_client.send_command(command)
>  1256 return_value = get_return_value(
>  -> 1257 answer, self.gateway_client, self.target_id, self.name)
>  1258 
>  1259 for temp_arg in temp_args:
> C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
>  ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
>  326 raise Py4JJavaError(
>  327 "An error occurred while calling
> {0} \{1} \{2}
> .\n".
>  --> 328 format(target_id, ".", name), value)
>  329 else:
>  330 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python 
> worker failed to connect back.
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at 
> org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)
>  at org.apache.spark.api.python.PythonRDD.comput

[jira] [Created] (SPARK-34264) Prevent incomplete master URLs for Spark on Kubernetes early

2021-01-27 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-34264:
---

 Summary: Prevent incomplete master URLs for Spark on Kubernetes 
early
 Key: SPARK-34264
 URL: https://issues.apache.org/jira/browse/SPARK-34264
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Submit
Affects Versions: 3.0.1, 3.1.1
Reporter: Jacek Laskowski


It turns out that {{--master k8s://}} is accepted and although leads to 
termination displays stacktraces that don't really tell what the real cause is.

This may happen when the Kubernetes API server(s) are described by an 
environment variable that's not initialized in the current terminal.

{code}
$ ./bin/spark-shell --master k8s:// --verbose
...
Spark config:
(spark.jars,)
(spark.app.name,Spark shell)
(spark.submit.pyFiles,)
(spark.ui.showConsoleProgress,true)
(spark.submit.deployMode,client)
(spark.master,k8s://https://)
...
21/01/27 14:29:44 ERROR Main: Failed to initialize Spark session.
io.fabric8.kubernetes.client.KubernetesClientException: Failed to start 
websocket
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(WatchConnectionManager.java:208)
...
Caused by: java.net.UnknownHostException: api: nodename nor servname provided, 
or not known
at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at 
java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
at 
java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
at 
java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
at okhttp3.Dns$1.lookup(Dns.java:40)
at 
okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:185)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32333) Drop references to Master

2021-01-22 Thread Jacek Laskowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270058#comment-17270058
 ] 

Jacek Laskowski commented on SPARK-32333:
-

Just today when I was reading about a new ASF project - Apache Gobblin I found 
(highlighting mine): "Runs as a standalone cluster with *primary* and worker 
nodes." This "primary node" makes a lot of sense.

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34158) Incorrect url of the only developer Matei in pom.xml

2021-01-19 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-34158:
---

 Summary: Incorrect url of the only developer Matei in pom.xml
 Key: SPARK-34158
 URL: https://issues.apache.org/jira/browse/SPARK-34158
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 3.1.1
Reporter: Jacek Laskowski


{{[http://www.cs.berkeley.edu/~matei]}} in 
[pom.xml|https://github.com/apache/spark/blob/53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d/pom.xml#L51]
 gives
{quote}Resource not found
 The server has encountered a problem because the resource was not found.
 Your request was :
 [https://people.eecs.berkeley.edu/~matei]
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34131) NPE when driver.podTemplateFile defines no containers

2021-01-15 Thread Jacek Laskowski (Jira)

Jacek Laskowski created SPARK-34131:
---

 Summary: NPE when driver.podTemplateFile defines no containers
 Key: SPARK-34131
 URL: https://issues.apache.org/jira/browse/SPARK-34131
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1
Reporter: Jacek Laskowski


An empty pod template leads to the following NPE:

{code}
21/01/15 18:44:32 ERROR KubernetesUtils: Encountered exception while attempting 
to load initial pod spec from file
java.lang.NullPointerException
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.selectSparkContainer(KubernetesUtils.scala:108)
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.loadPodFromTemplate(KubernetesUtils.scala:88)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$1(KubernetesDriverBuilder.scala:36)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:32)
at 
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

{code:java}
$> cat empty-template.yml
spec:
{code}

{code}
$> ./bin/run-example \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --conf spark.kubernetes.driver.podTemplateFile=empty-template.yml \
  --name $POD_NAME \
  --jars local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar \
  --conf spark.kubernetes.container.image=spark:v3.0.1 \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
   SparkPi 10
{code}

It appears that the implicit requirement is that there's at least one 
well-defined container of any name (not necessarily 
{{spark.kubernetes.driver.podTemplateContainerName}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34024) datasourceV1 VS dataSourceV2

2021-01-06 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski resolved SPARK-34024.
-
Resolution: Invalid

Please post questions to the u...@spark.apache.org mailing list or on 
StackOverflow at https://stackoverflow.com/questions/tagged/apache-spark

> datasourceV1 VS  dataSourceV2 
> --
>
> Key: SPARK-34024
> URL: https://issues.apache.org/jira/browse/SPARK-34024
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Zhenglin luo
>Priority: Critical
>
> I found that DataSourceV2 has been through many versions .So why hasn't 
> datasourceV2 been used by default until now in the latest version.I want to 
> know if it’s because there is a big difference in execution efficiency 
> between v1 and v2. Or there are other reasons.
> Thanks a lot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources

2019-06-10 Thread Jacek Laskowski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860013#comment-16860013
 ] 

Jacek Laskowski commented on SPARK-27708:
-

[~rdblue] Mind if I asked you to update the requirements (= answer my 
questions)? Thanks.

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27977) MicroBatchWriter should use StreamWriter for human-friendly textual representation (toString)

2019-06-07 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-27977:
---

 Summary: MicroBatchWriter should use StreamWriter for 
human-friendly textual representation (toString)
 Key: SPARK-27977
 URL: https://issues.apache.org/jira/browse/SPARK-27977
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Jacek Laskowski


The following is a extended explain for a streaming query:

{code}
== Parsed Logical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- Project [value#39 AS value#0]
   +- Streaming RelationV2 socket[value#39] (Options: 
[host=localhost,port=])

== Analyzed Logical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- Project [value#39 AS value#0]
   +- Streaming RelationV2 socket[value#39] (Options: 
[host=localhost,port=])

== Optimized Logical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=])

== Physical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- *(1) Project [value#39]
   +- *(1) ScanV2 socket[value#39] (Options: [host=localhost,port=])
{code}

As you may have noticed, {{WriteToDataSourceV2}} is followed by the internal 
representation of {{MicroBatchWriter}} that is a mere adapter for 
{{StreamWriter}}, e.g. {{ConsoleWriter}}.

It'd be more debugging-friendly if the plans included whatever 
{{StreamWriter.toString}} would (which in case of {{ConsoleWriter}} would be 
{{ConsoleWriter[numRows=..., truncate=...]}} which gives more context).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27975) ConsoleSink should display alias and options for streaming progress

2019-06-07 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-27975:
---

 Summary: ConsoleSink should display alias and options for 
streaming progress
 Key: SPARK-27975
 URL: https://issues.apache.org/jira/browse/SPARK-27975
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Jacek Laskowski


{{console}} sink shows itself in progress with this internal instance 
representation as follows:

{code:json}
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a"
  }
{code}

That is not very user-friendly and would be much better for debugging if it 
included the alias and options as {{socket}} does:

{code}
  "sources" : [ {
"description" : "TextSocketV2[host: localhost, port: ]",
...
  } ],
{code}

The entire sample progress looks as follows:

{code}
19/06/07 11:47:18 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "26bedc9f-076f-4b15-8e17-f09609aaecac",
  "runId" : "8c365e74-7ac9-4fad-bf1b-397eb086661e",
  "name" : "socket-console",
  "timestamp" : "2019-06-07T09:47:18.969Z",
  "batchId" : 2,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "durationMs" : {
"getEndOffset" : 0,
"setOffsetRange" : 0,
"triggerExecution" : 0
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "TextSocketV2[host: localhost, port: ]",
"startOffset" : 0,
"endOffset" : 0,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0
  } ],
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a"
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution

2019-06-04 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-27933:

Summary: Extracting common purge "behaviour" to the parent StreamExecution  
(was: Introduce StreamExecution.purge for removing entries from metadata logs)

> Extracting common purge "behaviour" to the parent StreamExecution
> -
>
> Key: SPARK-27933
> URL: https://issues.apache.org/jira/browse/SPARK-27933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27933) Introduce StreamExecution.purge for removing entries from metadata logs

2019-06-03 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-27933:
---

 Summary: Introduce StreamExecution.purge for removing entries from 
metadata logs
 Key: SPARK-27933
 URL: https://issues.apache.org/jira/browse/SPARK-27933
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Jacek Laskowski


Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources

2019-05-14 Thread Jacek Laskowski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839846#comment-16839846
 ] 

Jacek Laskowski commented on SPARK-27708:
-

What's really needed? I've been reviewing the code of the new v2 data sources, 
but don't understand "How to plug in catalog implementations" and the others. 
I'd like to help with the docs and would like to start with "How to plug in 
catalog implementations". Please guide.

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27708) Add documentation for v2 data sources

2019-05-14 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-27708:

Labels: documentation  (was: docuentation)

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2019-02-12 Thread Jacek Laskowski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765971#comment-16765971
 ] 

Jacek Laskowski commented on SPARK-20597:
-

[~nimfadora] sure. go ahead.

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26063) CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on unresolved object" when requested for numberedTreeString

2018-11-14 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-26063:
---

 Summary: CatalystDataToAvro gives "UnresolvedException: Invalid 
call to dataType on unresolved object" when requested for numberedTreeString
 Key: SPARK-26063
 URL: https://issues.apache.org/jira/browse/SPARK-26063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jacek Laskowski


The following gives 
{{org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'id}}:
{code:java}
// ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0
scala> spark.version
res0: String = 2.4.0

import org.apache.spark.sql.avro._
val q = spark.range(1).withColumn("to_avro_id", to_avro('id))
val logicalPlan = q.queryExecution.logical

scala> logicalPlan.expressions.drop(1).head.numberedTreeString
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'id
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.avro.CatalystDataToAvro.simpleString(CatalystDataToAvro.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.Expression.verboseString(Expression.scala:233)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:569)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:469)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.numberedTreeString(TreeNode.scala:483)
... 51 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)

2018-11-14 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-26062:
---

 Summary: Rename spark-avro external module to spark-sql-avro (to 
match spark-sql-kafka)
 Key: SPARK-26062
 URL: https://issues.apache.org/jira/browse/SPARK-26062
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jacek Laskowski


Given the name of {{spark-sql-kafka}} external module it seems appropriate (and 
consistent) to rename {{spark-avro}} external module to {{spark-sql-avro}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-25278:

Description: 
When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view (as a LocalTableScan), but 
should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelations}} may also be affected).

  was:
When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view, but should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelations}} may also be affected).


> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view (as a LocalTableScan), but 
> should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelations}} may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-25278:

Description: 
When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view, but should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelation}}s may also be affected).

  was:
When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view, but should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelation}}s may also be affected).


> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view, but should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelation}}s may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-25278:

Description: 
When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view, but should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelations}} may also be affected).

  was:
When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view, but should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelation}}s may also be affected).


> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view, but should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelations}} may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-25278:

Attachment: union-3-views.png

> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view, but should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelation}}s may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-25278:
---

 Summary: Number of output rows metric of union of views is 
multiplied by their occurrences
 Key: SPARK-25278
 URL: https://issues.apache.org/jira/browse/SPARK-25278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Jacek Laskowski
 Attachments: union-2-views.png, union-3-views.png

When you use a view in a union multiple times (self-union), the {{number of 
output rows}} metric seems to be the correct {{number of output rows}} 
multiplied by the occurrences of the view, e.g.
{code:java}
scala> spark.version
res0: String = 2.3.1

val name = "demo_view"
sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
assert(spark.catalog.tableExists(name))

val view = spark.table(name)

assert(view.count == 2)

view.union(view).show // gives 4 for every view, but should be 2
view.union(view).union(view).show // gives 6{code}
I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
(and think other {{MultiInstanceRelation}}s may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-25278:

Attachment: union-2-views.png

> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view, but should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelation}}s may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2018-07-27 Thread Jacek Laskowski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559412#comment-16559412
 ] 

Jacek Laskowski commented on SPARK-20597:
-

Sorry [~Satyajit] for not responding earlier. I'd like to get back to it and am 
wondering if you'd still like to work on this ticket? I'd appreciate if you 
used a pull request to discuss issues if there are any. Thanks.

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24899) Add example of monotonically_increasing_id standard function to scaladoc

2018-07-24 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-24899:
---

 Summary: Add example of monotonically_increasing_id standard 
function to scaladoc
 Key: SPARK-24899
 URL: https://issues.apache.org/jira/browse/SPARK-24899
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.3.1
Reporter: Jacek Laskowski


I think an example of {{monotonically_increasing_id}} standard function in 
scaladoc would help people understand why the function is monotonically 
increasing and unique but not consecutive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24408) Move abs function to math_funcs group

2018-06-28 Thread Jacek Laskowski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-24408:

Summary: Move abs function to math_funcs group  (was: Move abs, bitwiseNOT, 
isnan, nanvl functions to math_funcs group)

> Move abs function to math_funcs group
> -
>
> Key: SPARK-24408
> URL: https://issues.apache.org/jira/browse/SPARK-24408
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are 
> not in {{math_funcs}} group. They should really be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24490) Use WebUI.addStaticHandler in web UIs

2018-06-07 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-24490:
---

 Summary: Use WebUI.addStaticHandler in web UIs
 Key: SPARK-24490
 URL: https://issues.apache.org/jira/browse/SPARK-24490
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Jacek Laskowski


{{WebUI}} defines {{addStaticHandler}} that web UIs don't use (and simply 
introduce duplication). Let's clean them up and remove duplications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24408) Move abs, bitwiseNOT, isnan, nanvl functions to math_funcs group

2018-05-29 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-24408:
---

 Summary: Move abs, bitwiseNOT, isnan, nanvl functions to 
math_funcs group
 Key: SPARK-24408
 URL: https://issues.apache.org/jira/browse/SPARK-24408
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.3.0
Reporter: Jacek Laskowski


A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are not 
in {{math_funcs}} group. They should really be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-20 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445520#comment-16445520
 ] 

Jacek Laskowski commented on SPARK-24025:
-

it seems related or duplicated

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-20 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445512#comment-16445512
 ] 

Jacek Laskowski commented on SPARK-24025:
-

I was about to have closed this as a duplicate, but I'm not so sure anymore. 
Reading the following in SPARK-17570:
{quote}If the number of buckets in the output table is a factor of the buckets 
in the input table, we should be able to avoid `Sort` and `Exchange` and 
directly join those.
{quote}
I'm no longer so sure that the two issues are perfectly identical, but they're 
close "relatives". I think the next step would be to write a test case to 
reproduce this issue and the fairly simple fix would be to have an physical 
optimization (rule) that would eliminate one extra Exchange and Sort ops. Who'd 
help me here?

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444998#comment-16444998
 ] 

Jacek Laskowski commented on SPARK-24025:
-

The other issue seems similar.

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-24025:

Attachment: join-jira.png

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-24025:
---

 Summary: Join of bucketed and non-bucketed tables can give two 
exchanges and sorts for non-bucketed side
 Key: SPARK-24025
 URL: https://issues.apache.org/jira/browse/SPARK-24025
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.3.1
 Environment: {code:java}
./bin/spark-shell --version
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
Branch master
Compiled by user sameera on 2018-02-22T19:24:29Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url g...@github.com:sameeragarwal/spark.git
Type --help for more information.{code}
Reporter: Jacek Laskowski


While exploring bucketing I found the following join query of non-bucketed and 
bucketed tables that ends up with two exchanges and two sorts in the physical 
plan for the non-bucketed join side.

{code}
// Make sure that you don't end up with a BroadcastHashJoin and a 
BroadcastExchange
// Disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"
val large = spark.range(100)
large.write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode("overwrite")
  .saveAsTable(bucketedTableName)

// Describe the table and include bucketing spec only
val descSQL = sql(s"DESC FORMATTED 
$bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
"Sort Columns")
scala> descSQL.show(truncate = false)
+--+-+---+
|col_name  |data_type|comment|
+--+-+---+
|Num Buckets   |4|   |
|Bucket Columns|[`id`]   |   |
|Sort Columns  |[`id`]   |   |
+--+-+---+

val bucketedTable = spark.table(bucketedTableName)
val t1 = spark.range(4)
  .repartition(2, $"id")  // Use just 2 partitions
  .sortWithinPartitions("id") // sort partitions

val q = t1.join(bucketedTable, "id")
// Note two exchanges and sorts
scala> q.explain
== Physical Plan ==
*(5) Project [id#79L]
+- *(5) SortMergeJoin [id#79L], [id#77L], Inner
   :- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#79L, 4)
   : +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
   :+- Exchange hashpartitioning(id#79L, 2)
   :   +- *(1) Range (0, 4, step=1, splits=8)
   +- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
  +- *(4) Project [id#77L]
 +- *(4) Filter isnotnull(id#77L)
+- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
true, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct

q.foreach(_ => ())
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-18 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442488#comment-16442488
 ] 

Jacek Laskowski commented on SPARK-23830:
-

It's about how easy it is to find out that the issue is `class` vs `object`. If 
that's just a single change that could be reported to end users to help them I 
think it's worth it.

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-03-30 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-23830:
---

 Summary: Spark on YARN in cluster deploy mode fail with 
NullPointerException when a Spark application is a Scala class not object
 Key: SPARK-23830
 URL: https://issues.apache.org/jira/browse/SPARK-23830
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.3.0
Reporter: Jacek Laskowski


As reported on StackOverflow in [Why does Spark on YARN fail with “Exception in 
thread ”Driver“ 
java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] 
the following Spark application fails with {{Exception in thread "Driver" 
java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:

{code}
class MyClass {

  def main(args: Array[String]): Unit = {
val c = new MyClass()
c.process()
  }

  def process(): Unit = {
val sparkConf = new SparkConf().setAppName("my-test")
val sparkSession: SparkSession = 
SparkSession.builder().config(sparkConf).getOrCreate()
import sparkSession.implicits._

  }

  ...
}
{code}

The exception is as follows:

{code}
18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
separate Thread
18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
initialization...
Exception in thread "Driver" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
{code}

I think the reason for the exception {{Exception in thread "Driver" 
java.lang.NullPointerException}} is due to [the following 
code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:

{code}
val mainMethod = userClassLoader.loadClass(args.userClass)
  .getMethod("main", classOf[Array[String]])
{code}

So when {{mainMethod}} is used in [the following 
code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
 it simply gives NPE.

{code}
mainMethod.invoke(null, userArgs.toArray)
{code}

That could be easily avoided with an extra check if the {{mainMethod}} is 
initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination

2018-03-18 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-23731:
---

 Summary: FileSourceScanExec throws NullPointerException in 
subexpression elimination
 Key: SPARK-23731
 URL: https://issues.apache.org/jira/browse/SPARK-23731
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1, 2.3.1
Reporter: Jacek Laskowski


While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I 
faced the following exception (in Spark 2.3.0):
{code:java}
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167)
  at 
org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502)
  at 
org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257)
  at 
org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36)
  at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358)
  at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40)
  at 
scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136)
  at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132)
  at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.get(HashMap.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$$anonfun$addExprTree$1.apply(EquivalentExpressions.scala:96)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$$anonfun$addExprTree$1.apply(EquivalentExpressions.scala:96)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:96)
  at 
org.apache.spark.sql.cataly

[jira] [Commented] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable

2018-03-14 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398899#comment-16398899
 ] 

Jacek Laskowski commented on SPARK-20536:
-

I'm not sure how meaningful it still is, but given it's still open, it could be 
fixed.

> Extend ColumnName to create StructFields with explicit nullable
> ---
>
> Key: SPARK-20536
> URL: https://issues.apache.org/jira/browse/SPARK-20536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{ColumnName}} defines methods to create {{StructFields}}.
> It'd be very user-friendly if there were methods to create {{StructFields}} 
> with explicit {{nullable}} property (currently implicitly {{true}}).
> That could look as follows:
> {code}
> // E.g. def int: StructField = StructField(name, IntegerType)
> def int(nullable: Boolean): StructField = StructField(name, IntegerType, 
> nullable)
> // or (untested)
> def int(nullable: Boolean): StructField = int.copy(nullable = nullable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23229) Dataset.hint should use planWithBarrier logical plan

2018-01-26 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-23229:
---

 Summary: Dataset.hint should use planWithBarrier logical plan
 Key: SPARK-23229
 URL: https://issues.apache.org/jira/browse/SPARK-23229
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Jacek Laskowski


Every time {{Dataset.hint}} is used it triggers execution of logical commands, 
their unions and hint resolution (among other things that analyzer does).

{{hint}} should use {{planWithBarrier}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

2018-01-16 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326959#comment-16326959
 ] 

Jacek Laskowski commented on SPARK-22457:
-

 That should be fairly easy to fix _iff_ we want to restrict the formats to 
{{FileFormat}} (that the mentioned formats are subtypes of).

Care to submit a pull request with the places where {{path}} is used to limit 
their scope to {{FileFormats}} only? (that would help draw more attention to 
the issue).

> Tables are supposed to be MANAGED only taking into account whether a path is 
> provided
> -
>
> Key: SPARK-22457
> URL: https://issues.apache.org/jira/browse/SPARK-22457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: David Arroyo
>Priority: Major
>
> As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
> taking into account whether a path is provided:
> {code:java}
> val tableType = if (storage.locationUri.isDefined) {
>   CatalogTableType.EXTERNAL
> } else {
>   CatalogTableType.MANAGED
> }
> {code}
> This solution seems to be right for filesystem based data sources. On the 
> other hand, when working with other data sources such as elasticsearch, that 
> solution is leading to a weird behaviour described below: 
> 1) InMemoryCatalog's doCreateTable() adds a locationURI if 
> CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.
> 2) Before loading the data source table FindDataSourceTable's 
> readDataSourceTable() adds a path option if locationURI exists:
> {code:java}
> val pathOption = table.storage.locationUri.map("path" -> 
> CatalogUtils.URIToString(_))
> {code}
> 3) That causes an error when reading from elasticsearch because 'path' is an 
> option already supported by elasticsearch (locationUri is set to 
> file:/home/user/spark-rv/elasticsearch/shop/clients)
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find 
> mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is 
> required before using Spark SQL
> Would be possible only to mark tables as MANAGED for a subset of data sources 
> (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?
> P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
> from my point of view should only be required for filesystem based data 
> sources: 
> {code:java}
>if (tableMeta.tableType == CatalogTableType.MANAGED)
>...
>// Delete the data/directory of the table
> val dir = new Path(tableMeta.location)
> try {
>   val fs = dir.getFileSystem(hadoopConfig)
>   fs.delete(dir, true)
> } catch {
>   case e: IOException =>
> throw new SparkException(s"Unable to drop table $table as failed 
> " +
>   s"to delete its directory $dir", e)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")

2018-01-04 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-22954:
---

 Summary: ANALYZE TABLE fails with NoSuchTableException for 
temporary tables (but should have reported "not supported on views")
 Key: SPARK-22954
 URL: https://issues.apache.org/jira/browse/SPARK-22954
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: {code}
$ ./bin/spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
{code}
Reporter: Jacek Laskowski
Priority: Minor


{{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' not 
found in database 'default';}} for temporary tables (views) while the reason is 
that it can only work with permanent tables (which [it can 
report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38]
 if it had a chance).

{code}
scala> names.createOrReplaceTempView("names")

scala> sql("ANALYZE TABLE names COMPUTE STATISTICS")
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
'names' not found in database 'default';
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
  at 
org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187)
  at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243)
  at org.apache.spark.sql.Dataset.(Dataset.scala:187)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
  ... 50 elided
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException

2018-01-02 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308363#comment-16308363
 ] 

Jacek Laskowski commented on SPARK-22935:
-

It does not seem to be the case as described in 
https://stackoverflow.com/q/48026060/1305344 where the OP wanted to 
{{inferSchema}} with no values that would be of {{Date}} or {{Timestamp}} and 
hence Spark SQL infers strings.

But...I think all's fine though as I said at SO:

{quote}
TL;DR Define the schema explicitly since the input dataset does not have values 
to infer types from (for java.sql.Date fields).
{quote}

I think you can close the issue as {{Invalid}}.



> Dataset with Java Beans for java.sql.Date throws CompileException
> -
>
> Key: SPARK-22935
> URL: https://issues.apache.org/jira/browse/SPARK-22935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>
> The following code can throw an exception with or without whole-stage codegen.
> {code}
>   public void SPARK22935() {
> Dataset cdr = spark
> .read()
> .format("csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", ";")
> .csv("CDR_SAMPLE.csv")
> .as(Encoders.bean(CDR.class));
> Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != 
> null));
> long c = ds.count();
> cdr.show(2);
> ds.show(2);
> System.out.println("cnt=" + c);
>   }
> // CDR.java
> public class CDR implements java.io.Serializable {
>   public java.sql.Date timestamp;
>   public java.sql.Date getTimestamp() { return this.timestamp; }
>   public void setTimestamp(java.sql.Date timestamp) { this.timestamp = 
> timestamp; }
> }
> // CDR_SAMPLE.csv
> timestamp
> 2017-10-29T02:37:07.815Z
> 2017-10-29T02:38:07.815Z
> {code}
> result
> {code}
> 12:17:10.352 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 61, Column 70: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 61, Column 70: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages

2017-12-31 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307149#comment-16307149
 ] 

Jacek Laskowski commented on SPARK-22929:
-

When I saw the issue I was so much surprised as that's perhaps of the most 
often used data sources on...StackOverflow :) But then I'm not using pyspark 
(and so `spark-submit` may have got hosed for why it deals with python).

> Short name for "kafka" doesn't work in pyspark with packages
> 
>
> Key: SPARK-22929
> URL: https://issues.apache.org/jira/browse/SPARK-22929
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> When I start pyspark using the following command:
> {code}
> bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
> {code}
> The following throws an error:
> {code}
> spark.read.format("kakfa")...
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
> : java.lang.ClassNotFoundException: Failed to find data source: kakfa. Please 
> find packages at http://spark.apache.org/third-party-projects.html
> {code}
> The following does work:
> {code}
> spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22048) Show id, runId, batch in Description column in SQL tab for streaming queries (as in Jobs)

2017-09-18 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-22048:

Attachment: webui-jobs-description.png
webui-sql-description.png

> Show id, runId, batch in Description column in SQL tab for streaming queries 
> (as in Jobs)
> -
>
> Key: SPARK-22048
> URL: https://issues.apache.org/jira/browse/SPARK-22048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: webui-jobs-description.png, webui-sql-description.png
>
>
> web UI's Jobs tab shows {{id}}, {{runId}} and {{batch}} of every streaming 
> batch (which is very handy), but think SQL tab would benefit from it too 
> (perhaps even more given that's the tab where SQL queries are displayed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22048) Show id, runId, batch in Description column in SQL tab for streaming queries (as in Jobs)

2017-09-18 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-22048:
---

 Summary: Show id, runId, batch in Description column in SQL tab 
for streaming queries (as in Jobs)
 Key: SPARK-22048
 URL: https://issues.apache.org/jira/browse/SPARK-22048
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


web UI's Jobs tab shows {{id}}, {{runId}} and {{batch}} of every streaming 
batch (which is very handy), but think SQL tab would benefit from it too 
(perhaps even more given that's the tab where SQL queries are displayed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22044) explain function with codegen and cost parameters

2017-09-17 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-22044:
---

 Summary: explain function with codegen and cost parameters
 Key: SPARK-22044
 URL: https://issues.apache.org/jira/browse/SPARK-22044
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


{{explain}} operator creates {{ExplainCommand}} runnable command that accepts 
(among other things) {{codegen}} and {{cost}} arguments.

There's no version of {{explain}} to allow for this. That's however possible 
using SQL which is kind of surprising (given how much focus is devoted to the 
Dataset API).

This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, 
i.e.

{code}
def explain(codegen: Boolean = false, cost: Boolean = false): Unit
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22040) current_date function with timezone id

2017-09-17 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169215#comment-16169215
 ] 

Jacek Laskowski commented on SPARK-22040:
-

That'd be awesome! It's yours, [~mgaido]

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22040) current_date function with timezone id

2017-09-16 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-22040:
---

 Summary: current_date function with timezone id
 Key: SPARK-22040
 URL: https://issues.apache.org/jira/browse/SPARK-22040
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


{{current_date}} function creates {{CurrentDate}} expression that accepts 
optional timezone id, but there's no function to allow for this.

This is to have another {{current_date}} with the timezone id, i.e.

{code}
def current_date(timeZoneId: String): Column
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21901) Define toString for StateOperatorProgress

2017-09-03 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21901:
---

 Summary: Define toString for StateOperatorProgress
 Key: SPARK-21901
 URL: https://issues.apache.org/jira/browse/SPARK-21901
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Trivial


{{StateOperatorProgress}} should define its own {{toString}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator

2017-08-31 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21886:
---

 Summary: Use SparkSession.internalCreateDataFrame to create 
Dataset with LogicalRDD logical operator
 Key: SPARK-21886
 URL: https://issues.apache.org/jira/browse/SPARK-21886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


While exploring where {{LogicalRDD}} is created I noticed that there are a few 
places that beg for {{SparkSession.internalCreateDataFrame}}. The task is to 
simply re-use the method wherever possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-21728:

Attachment: logging.patch
sparksubmit.patch

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: logging.patch, sparksubmit.patch
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147957#comment-16147957
 ] 

Jacek Laskowski commented on SPARK-21728:
-

After I changed your change, I could see the logs again. No idea if the changes 
made sense or not, but see the logs and that counts :) I'm attaching my changes 
if these could help you somehow.



> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147840#comment-16147840
 ] 

Jacek Laskowski commented on SPARK-21728:
-

The idea behind the custom {{conf/log4j.properties}} is to disable all the 
logging and enable only {{org.apache.spark.sql.execution.streaming}} currently.

{code}
$ cat conf/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=OFF, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: 
%m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, 
the
# log level for this class is used to overwrite the root logger's log level, so 
that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

#log4j.logger.org.apache.spark=OFF

log4j.logger.org.apache.spark.metrics.MetricsSystem=WARN

# Structured Streaming
log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG
log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO
log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG

log4j.logger.org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec=INFO
{code}

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147643#comment-16147643
 ] 

Jacek Laskowski commented on SPARK-21728:
-

Thanks [~vanzin] for the prompt response! I'm stuck with the change as 
{{conf/log4j.properties}} has no effect on logging and given the change touched 
it I think it's the root cause (I might be mistaken, but looking for help to 
find it).

The following worked fine two days ago (not sure about yesterday's build). Is 
{{conf/log4j.properties}} still the file for logging?

{code}
# Structured Streaming
log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG
log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO
log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG
{code}

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147086#comment-16147086
 ] 

Jacek Laskowski commented on SPARK-21728:
-

Thanks [~sowen]. I'll label it as such when I know if it merits one (looks so, 
but waiting for a response from [~vanzin] or others who'd know). Sent out an 
email to the Spark user mailing list today.

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146780#comment-16146780
 ] 

Jacek Laskowski commented on SPARK-21728:
-

I think the change is user-visible and therefore deserves to be included in the 
release notes for 2.3 (I remember a component or label to mark changes like 
that in a special way) /cc [~sowen] [~hyukjin.kwon]

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-26 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142921#comment-16142921
 ] 

Jacek Laskowski commented on SPARK-21765:
-

BTW, *Assignee* field of the JIRA is empty, but should be [~joseph.torres] if 
I'm not mistaken.

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
> Key: SPARK-21765
> URL: https://issues.apache.org/jira/browse/SPARK-21765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
> streaming sources don't set the bit, and the bit can sometimes be lost in 
> rewriting. Setting the bit for all plans that are logically streaming will 
> help us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true

2017-08-26 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142919#comment-16142919
 ] 

Jacek Laskowski commented on SPARK-21765:
-

I think {{TextSocketSource}} was missed in the change and does not enable 
{{isStreaming}} flag leading to the following error:

{code}
val counts = spark.
  readStream.
  format("rate").
  load.
  groupBy(window($"timestamp", "5 seconds") as "group").
  agg(count("value") as "value_count")

import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = counts.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime(1.hour)).
  outputMode(OutputMode.Complete).
  start

17/08/26 21:16:20 ERROR StreamExecution: Query [id = 
980bfeba-5433-49db-9d20-055c965b25f3, runId = 
b711948f-4def-4205-ac43-ec1e5b852fdb] terminated with error
java.lang.AssertionError: assertion failed: DataFrame returned by getBatch from 
TextSocketSource[host: localhost, port: ] did not have isStreaming=true
Project [_1#66 AS value#71]
+- Project [_1#66]
   +- LocalRelation [_1#66, _2#67]

at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$10.apply(StreamExecution.scala:641)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$10.apply(StreamExecution.scala:636)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:636)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:636)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:274)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:60)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:635)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:313)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:301)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:301)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:274)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:60)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:301)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:297)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:213)
{code}

The change seems simple and {{TextSocketSource.getBatch}} should be as follows:

{code}
val rdd = sqlContext.sparkContext.parallelize(rawList).map(v => InternalRow(v))
val rawBatch = sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = 
true)
{code}

> Ensure all leaf nodes that are derived from streaming sources have 
> isStreaming=true
> ---
>
>

[jira] [Commented] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option

2017-08-09 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120576#comment-16120576
 ] 

Jacek Laskowski commented on SPARK-21667:
-

Oh, what an offer! Couldn't have thought of a better one today :) Let me see 
how far my hope to fix it leads me. I'm on it.

> ConsoleSink should not fail streaming query with checkpointLocation option
> --
>
> Key: SPARK-21667
> URL: https://issues.apache.org/jira/browse/SPARK-21667
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> As agreed on the Spark users mailing list in the thread "\[SS] Console sink 
> not supporting recovering from checkpoint location? Why?" in which 
> [~marmbrus] said:
> {quote}
> I think there is really no good reason for this limitation.
> {quote}
> Using {{ConsoleSink}} should therefore not fail a streaming query when used 
> with {{checkpointLocation}} option.
> {code}
> // today's build from the master
> scala> spark.version
> res8: String = 2.3.0-SNAPSHOT
> scala> val q = records.
>  |   writeStream.
>  |   format("console").
>  |   option("truncate", false).
>  |   option("checkpointLocation", "/tmp/checkpoint"). // <--
> checkpoint directory
>  |   trigger(Trigger.ProcessingTime(10.seconds)).
>  |   outputMode(OutputMode.Update).
>  |   start
> org.apache.spark.sql.AnalysisException: This query does not support 
> recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start 
> over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284)
>   ... 61 elided
> {code}
> The "trigger" is SPARK-16116 and [this 
> line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277]
>  in particular.
> This also relates to SPARK-19768 that was resolved as not a bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option

2017-08-08 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21667:
---

 Summary: ConsoleSink should not fail streaming query with 
checkpointLocation option
 Key: SPARK-21667
 URL: https://issues.apache.org/jira/browse/SPARK-21667
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Minor


As agreed on the Spark users mailing list in the thread "\[SS] Console sink not 
supporting recovering from checkpoint location? Why?" in which [~marmbrus] said:

{quote}
I think there is really no good reason for this limitation.
{quote}

Using {{ConsoleSink}} should therefore not fail a streaming query when used 
with {{checkpointLocation}} option.

{code}
// today's build from the master
scala> spark.version
res8: String = 2.3.0-SNAPSHOT

scala> val q = records.
 |   writeStream.
 |   format("console").
 |   option("truncate", false).
 |   option("checkpointLocation", "/tmp/checkpoint"). // <--
checkpoint directory
 |   trigger(Trigger.ProcessingTime(10.seconds)).
 |   outputMode(OutputMode.Update).
 |   start
org.apache.spark.sql.AnalysisException: This query does not support recovering 
from checkpoint location. Delete /tmp/checkpoint/offsets to start over.;
  at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
  at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
  at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284)
  ... 61 elided
{code}

The "trigger" is SPARK-16116 and [this 
line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277]
 in particular.

This also relates to SPARK-19768 that was resolved as not a bug.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure

2017-07-27 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-21546:

Description: 
With today's master...

The following streaming query with watermark and {{dropDuplicates}} yields 
{{RuntimeException}} due to failure in binding.

{code}
val topic1 = spark.
  readStream.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingoffsets", "earliest").
  load

val records = topic1.
  withColumn("eventtime", 'timestamp).  // <-- just to put the right name given 
the purpose
  withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- 
use the renamed eventtime column
  dropDuplicates("value").  // dropDuplicates will use watermark
// only when eventTime column exists
  // include the watermark column => internal design leak?
  select('key cast "string", 'value cast "string", 'eventtime).
  as[(String, String, java.sql.Timestamp)]

scala> records.explain
== Physical Plan ==
*Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
value#170, eventtime#157-T3ms]
+- StreamingDeduplicate [value#1], 
StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0
   +- Exchange hashpartitioning(value#1, 200)
  +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
 +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
+- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, 
offset#4L, timestamp#5, timestampType#6]

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = records.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime("10 seconds")).
  queryName("from-kafka-topic1-to-console").
  outputMode(OutputMode.Update).
  start
{code}

{code}
---
Batch: 0
---
17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: eventtime#157-T3ms
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
at 
org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$class.watermarkPredicateForKeys(statefulOperators.scala:160)
at 
org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.watermarkPredicateForKeys$lzycompute(statefulOperators.scala:350)
at 
org.apache.spark.sql.exec

[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure

2017-07-27 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-21546:

Summary: dropDuplicates with watermark yields RuntimeException due to 
binding failure  (was: dropDuplicates followed by select yields 
RuntimeException due to binding failure)

> dropDuplicates with watermark yields RuntimeException due to binding failure
> 
>
> Key: SPARK-21546
> URL: https://issues.apache.org/jira/browse/SPARK-21546
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>
> With today's master...
> The following streaming query yields {{RuntimeException}} due to failure in 
> binding (most likely due to {{select}} operator).
> {code}
> val topic1 = spark.
>   readStream.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   option("startingoffsets", "earliest").
>   load
> val records = topic1.
>   withColumn("eventtime", 'timestamp).  // <-- just to put the right name 
> given the purpose
>   withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // 
> <-- use the renamed eventtime column
>   dropDuplicates("value").  // dropDuplicates will use watermark
> // only when eventTime column exists
>   // include the watermark column => internal design leak?
>   select('key cast "string", 'value cast "string", 'eventtime).
>   as[(String, String, java.sql.Timestamp)]
> scala> records.explain
> == Physical Plan ==
> *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
> value#170, eventtime#157-T3ms]
> +- StreamingDeduplicate [value#1], 
> StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0),
>  0
>+- Exchange hashpartitioning(value#1, 200)
>   +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
>  +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
> +- StreamingRelation kafka, [key#0, value#1, topic#2, 
> partition#3, offset#4L, timestamp#5, timestampType#6]
> import org.apache.spark.sql.streaming.{OutputMode, Trigger}
> val sq = records.
>   writeStream.
>   format("console").
>   option("truncate", false).
>   trigger(Trigger.ProcessingTime("10 seconds")).
>   queryName("from-kafka-topic1-to-console").
>   outputMode(OutputMode.Update).
>   start
> {code}
> {code}
> ---
> Batch: 0
> ---
> 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 
> 438)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: eventtime#157-T3ms
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$ap

[jira] [Created] (SPARK-21546) dropDuplicates followed by select yields RuntimeException due to binding failure

2017-07-27 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21546:
---

 Summary: dropDuplicates followed by select yields RuntimeException 
due to binding failure
 Key: SPARK-21546
 URL: https://issues.apache.org/jira/browse/SPARK-21546
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jacek Laskowski


With today's master...

The following streaming query yields {{RuntimeException}} due to failure in 
binding (most likely due to {{select}} operator).

{code}
val topic1 = spark.
  readStream.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingoffsets", "earliest").
  load

val records = topic1.
  withColumn("eventtime", 'timestamp).  // <-- just to put the right name given 
the purpose
  withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- 
use the renamed eventtime column
  dropDuplicates("value").  // dropDuplicates will use watermark
// only when eventTime column exists
  // include the watermark column => internal design leak?
  select('key cast "string", 'value cast "string", 'eventtime).
  as[(String, String, java.sql.Timestamp)]

scala> records.explain
== Physical Plan ==
*Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
value#170, eventtime#157-T3ms]
+- StreamingDeduplicate [value#1], 
StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0
   +- Exchange hashpartitioning(value#1, 200)
  +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
 +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
+- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, 
offset#4L, timestamp#5, timestampType#6]

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = records.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime("10 seconds")).
  queryName("from-kafka-topic1-to-console").
  outputMode(OutputMode.Update).
  start
{code}

{code}
---
Batch: 0
---
17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: eventtime#157-T3ms
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
at 
org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$class.w

[jira] [Created] (SPARK-21429) show on structured Dataset is equivalent to writeStream to console once

2017-07-16 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21429:
---

 Summary: show on structured Dataset is equivalent to writeStream 
to console once
 Key: SPARK-21429
 URL: https://issues.apache.org/jira/browse/SPARK-21429
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


While working with Datasets it's often helpful to do {{show}}. It does not work 
for streaming Datasets (and leads to {{AnalysisException}} - see below), but 
think it could just be the following under the covers and very helpful (would 
cut plenty of keystrokes for sure).

{code}
val sq = ...
scala> sq.isStreaming
res0: Boolean = true

import org.apache.spark.sql.streaming.Trigger
scala> sq.writeStream.format("console").trigger(Trigger.Once).start
{code}

Since {{show}} returns {{Unit}} that could just work.

Currently {{show}} reports {{AnalysisException}}.

{code}
scala> sq.show
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be 
executed with writeStream.start();;
rate
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
  at 
org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
  at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
  at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3027)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2340)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2553)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:671)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:630)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:639)
  ... 50 elided
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21427) Describe mapGroupsWithState and flatMapGroupsWithState for stateful aggregation in Structured Streaming

2017-07-16 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21427:
---

 Summary: Describe mapGroupsWithState and flatMapGroupsWithState 
for stateful aggregation in Structured Streaming
 Key: SPARK-21427
 URL: https://issues.apache.org/jira/browse/SPARK-21427
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Structured Streaming
Affects Versions: 2.2.1
Reporter: Jacek Laskowski


# Rename "Arbitrary Stateful Operations" to "Stateful Aggregations" (see 
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
 for the latest version)
# Describe two operators {{mapGroupsWithState}} and {{flatMapGroupsWithState}} 
with examples.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21329) Make EventTimeWatermarkExec explicitly UnaryExecNode

2017-07-06 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21329:
---

 Summary: Make EventTimeWatermarkExec explicitly UnaryExecNode
 Key: SPARK-21329
 URL: https://issues.apache.org/jira/browse/SPARK-21329
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Jacek Laskowski
Priority: Trivial


Make {{EventTimeWatermarkExec}} explicitly {{UnaryExecNode}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21313) ConsoleSink's string representation

2017-07-05 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21313:
---

 Summary: ConsoleSink's string representation
 Key: SPARK-21313
 URL: https://issues.apache.org/jira/browse/SPARK-21313
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Jacek Laskowski
Priority: Trivial


Add {{toString}} with options for {{ConsoleSink}} so it shows nicely in query 
progress.

*BEFORE*

{code}
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@4b340441"
  }
{code}

*AFTER*

{code}
  "sink" : {
"description" : "ConsoleSink[numRows=10, truncate=false]"
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21285) VectorAssembler should report the column name when data type used is not supported

2017-07-03 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21285:
---

 Summary: VectorAssembler should report the column name when data 
type used is not supported
 Key: SPARK-21285
 URL: https://issues.apache.org/jira/browse/SPARK-21285
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.1
Reporter: Jacek Laskowski
Priority: Minor


Found while answering [Why does LogisticRegression fail with 
“IllegalArgumentException: 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7”?|https://stackoverflow.com/q/44844793/1305344]
 on StackOverflow.

When {{VectorAssembler}} is configured to use columns of unsupported type only 
the type is printed out without the column name(s).

The column name(s) should be included too.

{code}
// label is of StringType type
val va = new VectorAssembler().setInputCols(Array("bc", "pmi", "label"))
scala> va.transform(training)
java.lang.IllegalArgumentException: Data type StringType is not supported.
  at 
org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
  at 
org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at 
org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at 
org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
  ... 48 elided
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2017-06-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070865#comment-16070865
 ] 

Jacek Laskowski commented on SPARK-20597:
-

Go for it, [~Satyajit]!

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"

2017-06-07 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040594#comment-16040594
 ] 

Jacek Laskowski commented on SPARK-20997:
-

Go ahead! Thanks [~guoxiaolongzte]!

> spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark 
> standalone with cluster deploy mode only"
> -
>
> Key: SPARK-20997
> URL: https://issues.apache.org/jira/browse/SPARK-20997
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Submit
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Just noticed that {{spark-submit}} describes {{--driver-cores}} under:
> * Spark standalone with cluster deploy mode only
> * YARN-only
> While I can understand "only" in "Spark standalone with cluster deploy mode 
> only" to refer to cluster deploy mode (not the default client mode), but 
> YARN-only baffles me which I think deserves a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"

2017-06-06 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20997:
---

 Summary: spark-submit's --driver-cores marked as "YARN-only" but 
listed under "Spark standalone with cluster deploy mode only"
 Key: SPARK-20997
 URL: https://issues.apache.org/jira/browse/SPARK-20997
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Minor


Just noticed that {{spark-submit}} describes {{--driver-cores}} under:

* Spark standalone with cluster deploy mode only
* YARN-only

While I can understand "only" in "Spark standalone with cluster deploy mode 
only" to refer to cluster deploy mode (not the default client mode), but 
YARN-only baffles me which I think deserves a fix.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20782) Dataset's isCached operator

2017-06-02 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035316#comment-16035316
 ] 

Jacek Laskowski commented on SPARK-20782:
-

Just stumbled upon {{CatalogImpl.isCached}} that could also be used to 
implement this feature.

> Dataset's isCached operator
> ---
>
> Key: SPARK-20782
> URL: https://issues.apache.org/jira/browse/SPARK-20782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It'd be very convenient to have {{isCached}} operator that would say whether 
> a query is cached in-memory or not.
> It'd be as simple as the following snippet:
> {code}
> // val q2: DataFrame
> spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2017-05-31 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20937:

Description: 
As a follow-up to SPARK-20297 (and SPARK-10400) in which 
{{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala and 
Hive, Spark SQL docs for [Parquet 
Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
 should have it documented.

p.s. It was asked about in [Why can't Impala read parquet files after Spark 
SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
today.

p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance Spark: 
Best Practices for Scaling and Optimizing Apache Spark" book (in Table 3-10. 
Parquet data source options) that gives the option some wider publicity.

  was:
As a follow-up to SPARK-20297 (and SPARK-10400) in which 
{{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala and 
Hive, Spark SQL docs for [Parquet 
Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
 should have it documented.

p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance Spark: 
Best Practices for Scaling and Optimizing Apache Spark" book (in Table 3-10. 
Parquet data source options) that gives the option some wider publicity.


> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2017-05-31 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20937:
---

 Summary: Describe spark.sql.parquet.writeLegacyFormat property in 
Spark SQL, DataFrames and Datasets Guide
 Key: SPARK-20937
 URL: https://issues.apache.org/jira/browse/SPARK-20937
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Trivial


As a follow-up to SPARK-20297 (and SPARK-10400) in which 
{{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala and 
Hive, Spark SQL docs for [Parquet 
Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
 should have it documented.

p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance Spark: 
Best Practices for Scaling and Optimizing Apache Spark" book (in Table 3-10. 
Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20865) caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"

2017-05-30 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski resolved SPARK-20865.
-
   Resolution: Won't Fix
Fix Version/s: 2.3.0
   2.2.0

{{cache}} is not allowed due to its eager execution.

> caching dataset throws "Queries with streaming sources must be executed with 
> writeStream.start()"
> -
>
> Key: SPARK-20865
> URL: https://issues.apache.org/jira/browse/SPARK-20865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Martin Brišiak
> Fix For: 2.2.0, 2.3.0
>
>
> {code}
> SparkSession
>   .builder
>   .master("local[*]")
>   .config("spark.sql.warehouse.dir", "C:/tmp/spark")
>   .config("spark.sql.streaming.checkpointLocation", 
> "C:/tmp/spark/spark-checkpoint")
>   .appName("my-test")
>   .getOrCreate
>   .readStream
>   .schema(schema)
>   .json("src/test/data")
>   .cache
>   .writeStream
>   .start
>   .awaitTermination
> {code}
> While executing this sample in spark got error. Without the .cache option it 
> worked as intended but with .cache option i got:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries 
> with streaming sources must be executed with writeStream.start();; 
> FileSource[src/test/data] at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
>  at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
>  at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
>  at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
>  at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
>  at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
>  at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102)
>  at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) 
> at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) 
> at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at 
> org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at 
> org.me.App$.main(App.scala:23) at org.me.App.main(App.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-05-30 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20927:
---

 Summary: Add cache operator to Unsupported Operations in 
Structured Streaming 
 Key: SPARK-20927
 URL: https://issues.apache.org/jira/browse/SPARK-20927
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Trivial


Just [found 
out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
 that {{cache}} is not allowed on streaming datasets.

{{cache}} on streaming datasets leads to the following exception:

{code}
scala> spark.readStream.text("files").cache
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be 
executed with writeStream.start();;
FileSource[files]
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
  at 
org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
  at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
  at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
  at 
org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
  at 
org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
  at 
org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
  at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
  at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
  ... 48 elided
{code}

It should be included in Structured Streaming's [Unsupported 
Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20912) map function with columns as strings

2017-05-29 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028285#comment-16028285
 ] 

Jacek Laskowski commented on SPARK-20912:
-

I can and I did, but the point is that it's not consistent with the other 
functions like {{struct}} and {{array}} and often gets in the way.

> map function with columns as strings
> 
>
> Key: SPARK-20912
> URL: https://issues.apache.org/jira/browse/SPARK-20912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> There's only {{map}} function that accepts {{Column}} values only. It'd be 
> very helpful to have a variant that accepted {{String}} for columns like 
> {{array}} or {{struct}}.
> {code}
> scala> val kvs = Seq(("key", "value")).toDF("k", "v")
> kvs: org.apache.spark.sql.DataFrame = [k: string, v: string]
> scala> kvs.printSchema
> root
>  |-- k: string (nullable = true)
>  |-- v: string (nullable = true)
> scala> kvs.withColumn("map", map("k", "v")).show
> :26: error: type mismatch;
>  found   : String("k")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>  ^
> :26: error: type mismatch;
>  found   : String("v")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>   ^
> // note $ to create Columns per string
> // not very dev-friendly
> scala> kvs.withColumn("map", map($"k", $"v")).show
> +---+-+-+
> |  k|v|  map|
> +---+-+-+
> |key|value|Map(key -> value)|
> +---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20912) map function with columns as strings

2017-05-29 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028208#comment-16028208
 ] 

Jacek Laskowski commented on SPARK-20912:
-

Nope as it would create a map of the two literals but I'd like to create a map 
from the values in {{k}} and {{v}} columns.

> map function with columns as strings
> 
>
> Key: SPARK-20912
> URL: https://issues.apache.org/jira/browse/SPARK-20912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> There's only {{map}} function that accepts {{Column}} values only. It'd be 
> very helpful to have a variant that accepted {{String}} for columns like 
> {{array}} or {{struct}}.
> {code}
> scala> val kvs = Seq(("key", "value")).toDF("k", "v")
> kvs: org.apache.spark.sql.DataFrame = [k: string, v: string]
> scala> kvs.printSchema
> root
>  |-- k: string (nullable = true)
>  |-- v: string (nullable = true)
> scala> kvs.withColumn("map", map("k", "v")).show
> :26: error: type mismatch;
>  found   : String("k")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>  ^
> :26: error: type mismatch;
>  found   : String("v")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>   ^
> // note $ to create Columns per string
> // not very dev-friendly
> scala> kvs.withColumn("map", map($"k", $"v")).show
> +---+-+-+
> |  k|v|  map|
> +---+-+-+
> |key|value|Map(key -> value)|
> +---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20912) map function with columns as strings

2017-05-29 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20912:
---

 Summary: map function with columns as strings
 Key: SPARK-20912
 URL: https://issues.apache.org/jira/browse/SPARK-20912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Trivial


There's only {{map}} function that accepts {{Column}} values only. It'd be very 
helpful to have a variant that accepted {{String}} for columns like {{array}} 
or {{struct}}.

{code}
scala> val kvs = Seq(("key", "value")).toDF("k", "v")
kvs: org.apache.spark.sql.DataFrame = [k: string, v: string]

scala> kvs.printSchema
root
 |-- k: string (nullable = true)
 |-- v: string (nullable = true)


scala> kvs.withColumn("map", map("k", "v")).show
:26: error: type mismatch;
 found   : String("k")
 required: org.apache.spark.sql.Column
   kvs.withColumn("map", map("k", "v")).show
 ^
:26: error: type mismatch;
 found   : String("v")
 required: org.apache.spark.sql.Column
   kvs.withColumn("map", map("k", "v")).show
  ^

// note $ to create Columns per string
// not very dev-friendly
scala> kvs.withColumn("map", map($"k", $"v")).show
+---+-+-+
|  k|v|  map|
+---+-+-+
|key|value|Map(key -> value)|
+---+-+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20782) Dataset's isCached operator

2017-05-17 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20782:
---

 Summary: Dataset's isCached operator
 Key: SPARK-20782
 URL: https://issues.apache.org/jira/browse/SPARK-20782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Trivial


It'd be very convenient to have {{isCached}} operator that would say whether a 
query is cached in-memory or not.

It'd be as simple as the following snippet:

{code}
// val q2: DataFrame
spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4570) Add broadcast join to left semi join

2017-05-16 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-4570:
---
Description: 
For now, spark use broadcast join instead of hash join to optimize {{inner 
join}} when the size of one side data did not reach the 
{{AUTO_BROADCASTJOIN_THRESHOLD}}
However,Spark SQL will perform shuffle operations on each child relations while 
executing {{left semi join}}  is more suitable for optimiztion with broadcast 
join. 
We are planning to create a {{BroadcastLeftSemiJoinHash}} to implement the 
broadcast join for {{left semi join}}.

  was:
For now, spark use broadcast join instead of hash join to optimize {{inner 
join}} when the size of one side data did not reach the 
{{AUTO_BROADCASTJOIN_THRESHOLD}}
However,Spark SQL will perform shuffle operations on each child relations while 
executing {{left semi join}}  is more suitable for optimiztion with broadcast 
join. 
We are planning to create a{{BroadcastLeftSemiJoinHash}} to implement the 
broadcast join for {{left semi join}}


> Add broadcast  join to left semi join
> -
>
> Key: SPARK-4570
> URL: https://issues.apache.org/jira/browse/SPARK-4570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: XiaoJing wang
>Priority: Minor
> Fix For: 1.3.0
>
>
> For now, spark use broadcast join instead of hash join to optimize {{inner 
> join}} when the size of one side data did not reach the 
> {{AUTO_BROADCASTJOIN_THRESHOLD}}
> However,Spark SQL will perform shuffle operations on each child relations 
> while executing {{left semi join}}  is more suitable for optimiztion with 
> broadcast join. 
> We are planning to create a {{BroadcastLeftSemiJoinHash}} to implement the 
> broadcast join for {{left semi join}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)

2017-05-11 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20600:

Description: 
Executing the following batch query gives the default stringified *internal JVM 
representation* of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
attachment.

{code}
spark.
  read.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  load.
  show
{code}


  was:
Executing the following batch query gives the default stringified/internal name 
of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
attachment.

{code}
spark.
  read.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  load.
  select('value cast "string").
  write.
  csv("fromkafka.csv")
{code}



> KafkaRelation should be pretty printed in web UI (Details for Query)
> 
>
> Key: SPARK-20600
> URL: https://issues.apache.org/jira/browse/SPARK-20600
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.2.0
>
> Attachments: kafka-source-scan-webui.png
>
>
> Executing the following batch query gives the default stringified *internal 
> JVM representation* of {{KafkaRelation}} in web UI (under Details for Query), 
> i.e. http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See 
> the attachment.
> {code}
> spark.
>   read.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI

2017-05-10 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20691:
---

 Summary: Difference between Storage Memory as seen internally and 
in web UI
 Key: SPARK-20691
 URL: https://issues.apache.org/jira/browse/SPARK-20691
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Jacek Laskowski


I set Major priority as it's visible to a user.

There's a difference in what the size of Storage Memory is managed internally 
and displayed to a user in web UI.

I found it while answering [How does web UI calculate Storage Memory (in 
Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow.

In short (quoting the main parts), when you start a Spark app (say spark-shell) 
you see 912.3 MB RAM for Storage Memory:

{code}
$ ./bin/spark-shell --conf spark.driver.memory=2g
...
17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, 57177, 
None)
{code}

but in the web UI you'll see 956.6 MB due to the way the custom JavaScript 
function {{formatBytes}} in 
[utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48]
 calculates the value. That translates to the following Scala code:

{code}
def formatBytes(bytes: Double) = {
  val k = 1000
  val i = math.floor(math.log(bytes) / math.log(k))
  val maxMemoryWebUI = bytes / math.pow(k, i)
  f"$maxMemoryWebUI%1.1f"
}
scala> println(formatBytes(maxMemory))
956.6
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-08 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001259#comment-16001259
 ] 

Jacek Laskowski commented on SPARK-20630:
-

Go [~ajbozarth], go!

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 298 matches

Mail list logo