[jira] [Updated] (SPARK-43947) Incorrect SparkException when missing config in resources in Stage-Level Scheduling
[ https://issues.apache.org/jira/browse/SPARK-43947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-43947: Summary: Incorrect SparkException when missing config in resources in Stage-Level Scheduling (was: Incorrect SparkException when missing amount in resources in Stage-Level Scheduling) > Incorrect SparkException when missing config in resources in Stage-Level > Scheduling > --- > > Key: SPARK-43947 > URL: https://issues.apache.org/jira/browse/SPARK-43947 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.4.0 >Reporter: Jacek Laskowski >Priority: Minor > > [ResourceUtils.listResourceIds|https://github.com/apache/spark/blob/807abf9c53ee8c1c7ef69646ebd8a266f60d5580/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala#L152-L155] > can throw an exception for any missing config, not just `amount`. > {code:scala} > val index = key.indexOf('.') > if (index < 0) { > throw new SparkException(s"You must specify an amount config for > resource: $key " + > s"config: $componentName.$RESOURCE_PREFIX.$key") > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43947) Incorrect SparkException when missing amount in resources in Stage-Level Scheduling
Jacek Laskowski created SPARK-43947: --- Summary: Incorrect SparkException when missing amount in resources in Stage-Level Scheduling Key: SPARK-43947 URL: https://issues.apache.org/jira/browse/SPARK-43947 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.4.0 Reporter: Jacek Laskowski [ResourceUtils.listResourceIds|https://github.com/apache/spark/blob/807abf9c53ee8c1c7ef69646ebd8a266f60d5580/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala#L152-L155] can throw an exception for any missing config, not just `amount`. {code:scala} val index = key.indexOf('.') if (index < 0) { throw new SparkException(s"You must specify an amount config for resource: $key " + s"config: $componentName.$RESOURCE_PREFIX.$key") } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43912) Incorrect SparkException for Stage-Level Scheduling in local mode
Jacek Laskowski created SPARK-43912: --- Summary: Incorrect SparkException for Stage-Level Scheduling in local mode Key: SPARK-43912 URL: https://issues.apache.org/jira/browse/SPARK-43912 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.4.0 Environment: ```text scala> println(spark.version) 3.4.0 scala> println(sc.master) local[*] ``` Reporter: Jacek Laskowski While in `local[*]` mode, the following `SparkException` is thrown: ```text org.apache.spark.SparkException: TaskResourceProfiles are only supported for Standalone cluster for now when dynamic allocation is disabled. at org.apache.spark.resource.ResourceProfileManager.isSupported(ResourceProfileManager.scala:71) at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:126) at org.apache.spark.rdd.RDD.withResources(RDD.scala:1802) ... 42 elided ``` This happens for the following snippet: ```scala val rdd = sc.range(0, 9) import org.apache.spark.resource.ResourceProfileBuilder val rpb = new ResourceProfileBuilder val rp1 = rpb.build() rdd.withResources(rp1) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43152) User-defined output metadata path (_spark_metadata)
[ https://issues.apache.org/jira/browse/SPARK-43152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-43152: Summary: User-defined output metadata path (_spark_metadata) (was: Parametrisable output metadata path (_spark_metadata)) > User-defined output metadata path (_spark_metadata) > --- > > Key: SPARK-43152 > URL: https://issues.apache.org/jira/browse/SPARK-43152 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wojciech Indyk >Priority: Major > > Currently path of metadata of output checkpoint is hardcoded. The metadata is > saved in output path in _spark_metadata folder. It's a constraint on > structure of paths, that might be easily relaxed by parametrisable path of > output metadata. It would help with issues like [changing output directory of > spark streaming > job|https://kb.databricks.com/en_US/streaming/file-sink-streaming], [two jobs > writing to the same output > path|https://issues.apache.org/jira/browse/SPARK-30542] or [partition > discovery|https://stackoverflow.com/questions/61904732/is-it-possible-to-change-location-of-spark-metadata-folder-in-spark-structured/61905158]. > It would also help with separation of metadata from data in path structure. > The main target of change is getMetadataLogPath method in FileStreamSink. It > has got access to sqlConf, so this method can override the default > _spark_metadata path if defined it config. Introduction of parametrised > metadata path needs reconsidering of meaning of hasMetadata method in > FileStreamSink. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42977) spark sql Disable vectorized faild
[ https://issues.apache.org/jira/browse/SPARK-42977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707256#comment-17707256 ] Jacek Laskowski commented on SPARK-42977: - Unless you can reproduce it without Iceberg, it's probably an Iceberg issue and should be reported in https://github.com/apache/iceberg/issues. > spark sql Disable vectorized faild > --- > > Key: SPARK-42977 > URL: https://issues.apache.org/jira/browse/SPARK-42977 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.2 >Reporter: liu >Priority: Major > Fix For: 3.3.2 > > > spark-sql config > {code:java} > ./spark-sql --packages > org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0\ > --conf > spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions > \ > --conf > spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ > --conf spark.sql.catalog.spark_catalog.type=hive \ > --conf spark.sql.iceberg.handle-timestamp-without-timezone=true \ > --conf spark.sql.parquet.binaryAsString=true \ > --conf spark.sql.parquet.enableVectorizedReader=false \ > --conf spark.sql.parquet.enableNestedColumnVectorizedReader=true \ > --conf spark.sql.parquet.recordLevelFilter=true {code} > > Now that I have configured spark. sql. queue. > enableVectorizedReader=false,but i query a iceberg parquet table,the > following error occurred: > > > {code:java} > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:498) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:286) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at > org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: > java.lang.UnsupportedOperationException: Cannot support vectorized reads for > column [hzxm] optional binary hzxm = 8 with encoding DELTA_BYTE_ARRAY. > Disable vectorized reads to read this table/file at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:100) > at > org.apache.iceberg.parquet.BasePageIterator.initFromPage(BasePageIterator.java:140) > at > org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:105) > at > org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:96) > at > org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) > at > org.apache.iceberg.parquet.BasePageIterator.setPage(BasePageIterator.java:95) > at > org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:61) > at > org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:50) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.setRowGroupInfo(Vec > {code} > > > *{color:#FF}Caused by: java.lang.UnsupportedOperationException: Cannot > support vectorized reads for column [hzxm] optional binary hzxm = 8 with > encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this > table/file{color}* > > > Now it seems that this parameter has not worked. How can I turn off this > function so that I can successfully query the table -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42496) Introducting Spark Connect at main page
[ https://issues.apache.org/jira/browse/SPARK-42496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-42496: Summary: Introducting Spark Connect at main page (was: Introduction Spark Connect at main page) > Introducting Spark Connect at main page > --- > > Key: SPARK-42496 > URL: https://issues.apache.org/jira/browse/SPARK-42496 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We should document the introduction of Spark Connect at PySpark main > documentation page to give a summary to users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40821) Fix late record filtering to support chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-40821: Summary: Fix late record filtering to support chaining of stateful operators (was: Fix late record filtering to support chaining of steteful operators) > Fix late record filtering to support chaining of stateful operators > --- > > Key: SPARK-40821 > URL: https://issues.apache.org/jira/browse/SPARK-40821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Assignee: Alex Balikov >Priority: Major > Fix For: 3.4.0 > > > Currently chaining of stateful operators is Spark Structured Streaming is not > supported for various reasons and is blocked by the unsupported operations > check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix > this as chaining of stateful operators is a common streaming scenario - e.g. > stream-stream join -> windowed aggregation > window aggregation -> window aggregation > etc > What is broken: > # every stateful operator performs late record filtering against the global > watermark. When chaining stateful operators (e.g. window aggregations) the > output produced by the first stateful operator is effectively late against > the watermark and thus filtered out by the next operator late record > filtering (technically the next operator should not do late record filtering > but it can be changed to assert for correctness detection, etc) > # when chaining window aggregations, the first window aggregating operator > produces records with schema \{ window: { start: Timestamp, end: Timestamp }, > agg: Long } - there is not explicit event time in the schema to be used by > the next stateful operator (the correct event time should be window.end - 1 ) > # stream-stream time-interval join can produce late records by semantics, > e.g. if the join condition is: > left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - > INTERVAL 1 HOUR > the produced records can be delayed by 1 hr relative to the > watermark. > Proposed fixes: > 1. 1 can be fixed by performing late record filtering against the previous > microbatch watermark instead of the current microbatch watermark. > 2. 2 can be fixed by allowing the window and session_window functions to work > on the window column directly and compute the correct event time > transparently to the user. Also, introduce window_time SQL function to > compute correct event time from the window column. > 3. 3 can be fixed by adding support for per-operator watermarks instead of a > single global watermark. In the example of stream-stream time interval join > followed by a stateful operator, the join operator will 'delay' the > downstream operator watermarks by a correct value to handle the delayed > records. Only stream-stream time-interval joins will be delaying the > watermark, any other operators will not delay downstream watermarks. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40807) "RocksDB: commit - pause bg time total" metric always 0
[ https://issues.apache.org/jira/browse/SPARK-40807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-40807: Summary: "RocksDB: commit - pause bg time total" metric always 0 (was: RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric) > "RocksDB: commit - pause bg time total" metric always 0 > --- > > Key: SPARK-40807 > URL: https://issues.apache.org/jira/browse/SPARK-40807 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-streams-commit-pause-bg-time.png > > > {{RocksDBStateStore}} uses > [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131] > key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} > uses > [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308] > to provide a value every commit. That leads to a name mismatch and 0 > reported (as the default value for no metrics). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40807) RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric
[ https://issues.apache.org/jira/browse/SPARK-40807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-40807: Summary: RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric (was: RocksDBStateStore always 0 for pause bg time total metric) > RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric > - > > Key: SPARK-40807 > URL: https://issues.apache.org/jira/browse/SPARK-40807 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-streams-commit-pause-bg-time.png > > > {{RocksDBStateStore}} uses > [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131] > key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} > uses > [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308] > to provide a value every commit. That leads to a name mismatch and 0 > reported (as the default value for no metrics). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40807) RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric
[ https://issues.apache.org/jira/browse/SPARK-40807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-40807: Attachment: spark-streams-commit-pause-bg-time.png > RocksDBStateStore always 0 for "RocksDB: commit - pause bg time total" metric > - > > Key: SPARK-40807 > URL: https://issues.apache.org/jira/browse/SPARK-40807 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-streams-commit-pause-bg-time.png > > > {{RocksDBStateStore}} uses > [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131] > key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} > uses > [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308] > to provide a value every commit. That leads to a name mismatch and 0 > reported (as the default value for no metrics). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40807) RocksDBStateStore always 0 for pause bg time total metric
Jacek Laskowski created SPARK-40807: --- Summary: RocksDBStateStore always 0 for pause bg time total metric Key: SPARK-40807 URL: https://issues.apache.org/jira/browse/SPARK-40807 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Jacek Laskowski {{RocksDBStateStore}} uses [pauseBg|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala#L131] key to report "RocksDB: commit - pause bg time" metric while {{RocksDB}} uses [pause|https://github.com/apache/spark/blob/v3.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L308] to provide a value every commit. That leads to a name mismatch and 0 reported (as the default value for no metrics). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533465#comment-17533465 ] Jacek Laskowski commented on SPARK-17556: - Given: # "I'm running a large query with over 100,000 tasks." # "Total size of serialized results ... is bigger than spark.driver.maxResultSize". I think the issue is no a broadcast join but the size of the result (as computed by these 100k tasks). They have to report back to the driver and I can't think of a reason why a broadcast join would make it any worse? I must be missing something obvious (and chimed in to learn a bit about Spark SQL from you today :)) > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin >Priority: Major > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski resolved SPARK-36904. - Resolution: Invalid I finally managed to find the root cause of the issue which is {{conf/hive-site.xml}} in {{HIVE_HOME}} with the driver configured (!) Sorry for a false alarm. > The specified datastore driver ("org.postgresql.Driver") was not found in the > CLASSPATH > --- > > Key: SPARK-36904 > URL: https://issues.apache.org/jira/browse/SPARK-36904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Spark 3.2.0 (RC6) > {code:java} > $ ./bin/spark-shell --version > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 > Branch heads/v3.2.0-rc6 > Compiled by user jacek on 2021-09-30T10:44:35Z > Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 > Url https://github.com/apache/spark.git > Type --help for more information. > {code} > Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the > following command: > {code:java} > $ ./build/mvn \ > -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ > -DskipTests \ > clean install > {code} > {code:java} > $ java -version > openjdk version "11.0.12" 2021-07-20 > OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) > OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) > {code} >Reporter: Jacek Laskowski >Priority: Critical > Attachments: exception.txt > > > It looks similar to [hivethriftserver built into spark3.0.0. is throwing > error "org.postgresql.Driver" was not found in the > CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here > for future reference. > After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe > table covid_19")`. That gave me the exception (a full version is attached): > {code} > Caused by: java.lang.reflect.InvocationTargetException: > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" > plugin to create a ConnectionPool gave an error : The specified datastore > driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check > your CLASSPATH specification, and the name of the driver. > at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285) > at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133) > at > org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817) > ... 171 more > Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the > "BONECP" plugin to create a ConnectionPool gave an error : The specified > datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. > Please check your CLASSPATH specification, and the name of the driver. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117) > at > org.d
[jira] [Commented] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422784#comment-17422784 ] Jacek Laskowski commented on SPARK-36904: - The table does not exist. I simply tried to execute a random SQL command. I'll give the binary tarball a go instead but there could be a difference in how it was built (what profiles were used). I'll investigate further. Thanks. > The specified datastore driver ("org.postgresql.Driver") was not found in the > CLASSPATH > --- > > Key: SPARK-36904 > URL: https://issues.apache.org/jira/browse/SPARK-36904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Spark 3.2.0 (RC6) > {code:java} > $ ./bin/spark-shell --version > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 > Branch heads/v3.2.0-rc6 > Compiled by user jacek on 2021-09-30T10:44:35Z > Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 > Url https://github.com/apache/spark.git > Type --help for more information. > {code} > Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the > following command: > {code:java} > $ ./build/mvn \ > -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ > -DskipTests \ > clean install > {code} > {code:java} > $ java -version > openjdk version "11.0.12" 2021-07-20 > OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) > OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) > {code} >Reporter: Jacek Laskowski >Priority: Critical > Attachments: exception.txt > > > It looks similar to [hivethriftserver built into spark3.0.0. is throwing > error "org.postgresql.Driver" was not found in the > CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here > for future reference. > After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe > table covid_19")`. That gave me the exception (a full version is attached): > {code} > Caused by: java.lang.reflect.InvocationTargetException: > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" > plugin to create a ConnectionPool gave an error : The specified datastore > driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check > your CLASSPATH specification, and the name of the driver. > at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285) > at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133) > at > org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817) > ... 171 more > Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the > "BONECP" plugin to create a ConnectionPool gave an error : The specified > datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. > Please check your CLASSPATH specification, and the name of the driver. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232) > at > org.datanucleus.st
[jira] [Commented] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422727#comment-17422727 ] Jacek Laskowski commented on SPARK-36904: - The solution from [this answer on SO|https://stackoverflow.com/a/62588338/1305344] didn't work for me and am still facing the exception. {code} ./build/mvn \ -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ -DskipTests \ dependency:purge-local-repository \ clean install {code} {code} Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:82) ... 190 more Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58) at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:213) ... 192 more {code} > The specified datastore driver ("org.postgresql.Driver") was not found in the > CLASSPATH > --- > > Key: SPARK-36904 > URL: https://issues.apache.org/jira/browse/SPARK-36904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Spark 3.2.0 (RC6) > {code:java} > $ ./bin/spark-shell --version > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 > Branch heads/v3.2.0-rc6 > Compiled by user jacek on 2021-09-30T10:44:35Z > Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 > Url https://github.com/apache/spark.git > Type --help for more information. > {code} > Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the > following command: > {code:java} > $ ./build/mvn \ > -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ > -DskipTests \ > clean install > {code} > {code:java} > $ java -version > openjdk version "11.0.12" 2021-07-20 > OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) > OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) > {code} >Reporter: Jacek Laskowski >Priority: Critical > Attachments: exception.txt > > > It looks similar to [hivethriftserver built into spark3.0.0. is throwing > error "org.postgresql.Driver" was not found in the > CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here > for future reference. > After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe > table covid_19")`. That gave me the exception (a full version is attached): > {code} > Caused by: java.lang.reflect.InvocationTargetException: > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" > plugin to create a ConnectionPool gave an error : The specified datastore > driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check > your CLASSPATH specification, and the name of the driver. > at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162) > at > org.data
[jira] [Updated] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-36904: Attachment: exception.txt > The specified datastore driver ("org.postgresql.Driver") was not found in the > CLASSPATH > --- > > Key: SPARK-36904 > URL: https://issues.apache.org/jira/browse/SPARK-36904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Spark 3.2.0 (RC6) > {code:java} > $ ./bin/spark-shell --version > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 > Branch heads/v3.2.0-rc6 > Compiled by user jacek on 2021-09-30T10:44:35Z > Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 > Url https://github.com/apache/spark.git > Type --help for more information. > {code} > Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the > following command: > {code:java} > $ ./build/mvn \ > -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ > -DskipTests \ > clean install > {code} > {code:java} > $ java -version > openjdk version "11.0.12" 2021-07-20 > OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) > OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) > {code} >Reporter: Jacek Laskowski >Priority: Critical > Attachments: exception.txt > > > It looks similar to [hivethriftserver built into spark3.0.0. is throwing > error "org.postgresql.Driver" was not found in the > CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here > for future reference. > After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe > table covid_19")`. That gave me the exception (a full version is attached): > {code} > Caused by: java.lang.reflect.InvocationTargetException: > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" > plugin to create a ConnectionPool gave an error : The specified datastore > driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check > your CLASSPATH specification, and the name of the driver. > at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285) > at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133) > at > org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817) > ... 171 more > Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the > "BONECP" plugin to create a ConnectionPool gave an error : The specified > datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. > Please check your CLASSPATH specification, and the name of the driver. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:82) > ... 187 more > Caused by: > org.datanucleus.store.rdbms.connectionpool.Datas
[jira] [Updated] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-36904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-36904: Environment: Spark 3.2.0 (RC6) {code:java} $ ./bin/spark-shell --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/ Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 Branch heads/v3.2.0-rc6 Compiled by user jacek on 2021-09-30T10:44:35Z Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 Url https://github.com/apache/spark.git Type --help for more information. {code} Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the following command: {code:java} $ ./build/mvn \ -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ -DskipTests \ clean install {code} {code:java} $ java -version openjdk version "11.0.12" 2021-07-20 OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) {code} was: {code:java} $ ./bin/spark-shell --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/ Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 Branch heads/v3.2.0-rc6 Compiled by user jacek on 2021-09-30T10:44:35Z Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 Url https://github.com/apache/spark.git Type --help for more information. {code} Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the following command: {code:java} $ ./build/mvn \ -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ -DskipTests \ clean install {code} {code:java} $ java -version openjdk version "11.0.12" 2021-07-20 OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) {code} > The specified datastore driver ("org.postgresql.Driver") was not found in the > CLASSPATH > --- > > Key: SPARK-36904 > URL: https://issues.apache.org/jira/browse/SPARK-36904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Spark 3.2.0 (RC6) > {code:java} > $ ./bin/spark-shell --version > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 > Branch heads/v3.2.0-rc6 > Compiled by user jacek on 2021-09-30T10:44:35Z > Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 > Url https://github.com/apache/spark.git > Type --help for more information. > {code} > Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the > following command: > {code:java} > $ ./build/mvn \ > -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ > -DskipTests \ > clean install > {code} > {code:java} > $ java -version > openjdk version "11.0.12" 2021-07-20 > OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) > OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) > {code} >Reporter: Jacek Laskowski >Priority: Critical > > It looks similar to [hivethriftserver built into spark3.0.0. is throwing > error "org.postgresql.Driver" was not found in the > CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here > for future reference. > After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe > table covid_19")`. That gave me the exception (a full version is attached): > {code} > Caused by: java.lang.reflect.InvocationTargetException: > org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" > plugin to create a ConnectionPool gave an error : The specified datastore > driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check > your CLASSPATH specification, and the name of the driver. > at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExte
[jira] [Created] (SPARK-36904) The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH
Jacek Laskowski created SPARK-36904: --- Summary: The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH Key: SPARK-36904 URL: https://issues.apache.org/jira/browse/SPARK-36904 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Environment: {code:java} $ ./bin/spark-shell --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/ Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.12 Branch heads/v3.2.0-rc6 Compiled by user jacek on 2021-09-30T10:44:35Z Revision dde73e2e1c7e55c8e740cb159872e081ddfa7ed6 Url https://github.com/apache/spark.git Type --help for more information. {code} Built from [https://github.com/apache/spark/commits/v3.2.0-rc6] using the following command: {code:java} $ ./build/mvn \ -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ -DskipTests \ clean install {code} {code:java} $ java -version openjdk version "11.0.12" 2021-07-20 OpenJDK Runtime Environment Temurin-11.0.12+7 (build 11.0.12+7) OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (build 11.0.12+7, mixed mode) {code} Reporter: Jacek Laskowski It looks similar to [hivethriftserver built into spark3.0.0. is throwing error "org.postgresql.Driver" was not found in the CLASSPATH|https://stackoverflow.com/q/62534653/1305344], but reporting here for future reference. After I built the 3.2.0 (RC6) I ran `spark-shell` to execute `sql("describe table covid_19")`. That gave me the exception (a full version is attached): {code} Caused by: java.lang.reflect.InvocationTargetException: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. at jdk.internal.reflect.GeneratedConstructorAccessor64.newInstance(Unknown Source) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330) at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203) at org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:162) at org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:285) at jdk.internal.reflect.GeneratedConstructorAccessor63.newInstance(Unknown Source) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133) at org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817) ... 171 more Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:82) ... 187 more Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58) at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) at org.datanucleus.store.rdbms.ConnectionF
[jira] [Resolved] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook
[ https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski resolved SPARK-34351. - Resolution: Invalid Please use StackOverflow or the user@spark.a.o mailing list to ask this question (as described in [http://spark.apache.org/community.html]. See you there! > Running into "Py4JJavaError" while counting to text file or list using > Pyspark, Jupyter notebook > > > Key: SPARK-34351 > URL: https://issues.apache.org/jira/browse/SPARK-34351 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: PS> python --version > *Python 3.6.8* > PS> jupyter --version > j*upyter core : 4.7.0* > *jupyter-notebook : 6.2.0* > qtconsole : 5.0.2 > ipython : 7.16.1 > ipykernel : 5.4.3 > jupyter client : 6.1.11 > jupyter lab : not installed > nbconvert : 6.0.7 > ipywidgets : 7.6.3 > nbformat : 5.1.2 > traitlets : 4.3.3 > PS > java -version > *java version "1.8.0_271"* > Java(TM) SE Runtime Environment (build 1.8.0_271-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode) > > Spark versiyon > *spark-2.3.1-bin-hadoop2.7* >Reporter: Huseyin Elci >Priority: Major > > I run into the following error: > Any help resolving this error is greatly appreciated. > *My Code 1:* > {code:python} > import findspark > findspark.init("C:\Spark") > from pyspark.sql import SparkSession > from pyspark.conf import SparkConf > spark = SparkSession.builder\ > .master("local[4]")\ > .appName("WordCount_RDD")\ > .getOrCreate() > sc = spark.sparkContext > data = "D:\\05 Spark\\data\\MyArticle.txt" > story_rdd = sc.textFile(data) > story_rdd.count() > {code} > *My Code 2:* > {code:python} > import findspark > findspark.init("C:\Spark") > from pyspark import SparkContext > sc = SparkContext() > mylist = [1,2,2,3,5,48,98,62,14,55] > mylist_rdd = sc.parallelize(mylist) > mylist_rdd.map(lambda x: x*x) > mylist_rdd.map(lambda x: x*x).collect() > {code} > *ERROR:* > I took same error code for my codes. > {code:python} > --- > Py4JJavaError Traceback (most recent call last) > in > > 1 story_rdd.count() > C:\Spark\python\pyspark\rdd.py in count(self) > 1071 3 > 1072 """ > -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > 1074 > 1075 def stats(self): > C:\Spark\python\pyspark\rdd.py in sum(self) > 1062 6.0 > 1063 """ > -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) > 1065 > 1066 def count(self): > C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op) > 933 # zeroValue provided to each partition is unique from the one provided > 934 # to the final reduce call > --> 935 vals = self.mapPartitions(func).collect() > 936 return reduce(op, vals, zeroValue) > 937 > C:\Spark\python\pyspark\rdd.py in collect(self) > 832 """ > 833 with SCCallSiteSync(self.context) as css: > --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) > 835 return list(_load_from_socket(sock_info, self._jrdd_deserializer)) > 836 > C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in > __call__(self, *args) > 1255 answer = self.gateway_client.send_command(command) > 1256 return_value = get_return_value( > -> 1257 answer, self.gateway_client, self.target_id, self.name) > 1258 > 1259 for temp_arg in temp_args: > C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 326 raise Py4JJavaError( > 327 "An error occurred while calling > {0} \{1} \{2} > .\n". > --> 328 format(target_id, ".", name), value) > 329 else: > 330 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 > (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python > worker failed to connect back. > at > org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86) > at org.apache.spark.api.python.PythonRDD.comput
[jira] [Created] (SPARK-34264) Prevent incomplete master URLs for Spark on Kubernetes early
Jacek Laskowski created SPARK-34264: --- Summary: Prevent incomplete master URLs for Spark on Kubernetes early Key: SPARK-34264 URL: https://issues.apache.org/jira/browse/SPARK-34264 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Submit Affects Versions: 3.0.1, 3.1.1 Reporter: Jacek Laskowski It turns out that {{--master k8s://}} is accepted and although leads to termination displays stacktraces that don't really tell what the real cause is. This may happen when the Kubernetes API server(s) are described by an environment variable that's not initialized in the current terminal. {code} $ ./bin/spark-shell --master k8s:// --verbose ... Spark config: (spark.jars,) (spark.app.name,Spark shell) (spark.submit.pyFiles,) (spark.ui.showConsoleProgress,true) (spark.submit.deployMode,client) (spark.master,k8s://https://) ... 21/01/27 14:29:44 ERROR Main: Failed to initialize Spark session. io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(WatchConnectionManager.java:208) ... Caused by: java.net.UnknownHostException: api: nodename nor servname provided, or not known at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929) at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519) at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848) at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302) at okhttp3.Dns$1.lookup(Dns.java:40) at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:185) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32333) Drop references to Master
[ https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270058#comment-17270058 ] Jacek Laskowski commented on SPARK-32333: - Just today when I was reading about a new ASF project - Apache Gobblin I found (highlighting mine): "Runs as a standalone cluster with *primary* and worker nodes." This "primary node" makes a lot of sense. > Drop references to Master > - > > Key: SPARK-32333 > URL: https://issues.apache.org/jira/browse/SPARK-32333 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > We have a lot of references to "master" in the code base. It will be > beneficial to remove references to problematic language that can alienate > potential community members. > SPARK-32004 removed references to slave > > Here is a IETF draft to fix up some of the most egregious examples > (master/slave, whitelist/backlist) with proposed alternatives. > https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34158) Incorrect url of the only developer Matei in pom.xml
Jacek Laskowski created SPARK-34158: --- Summary: Incorrect url of the only developer Matei in pom.xml Key: SPARK-34158 URL: https://issues.apache.org/jira/browse/SPARK-34158 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 3.1.1 Reporter: Jacek Laskowski {{[http://www.cs.berkeley.edu/~matei]}} in [pom.xml|https://github.com/apache/spark/blob/53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d/pom.xml#L51] gives {quote}Resource not found The server has encountered a problem because the resource was not found. Your request was : [https://people.eecs.berkeley.edu/~matei] {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34131) NPE when driver.podTemplateFile defines no containers
Jacek Laskowski created SPARK-34131: --- Summary: NPE when driver.podTemplateFile defines no containers Key: SPARK-34131 URL: https://issues.apache.org/jira/browse/SPARK-34131 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.1 Reporter: Jacek Laskowski An empty pod template leads to the following NPE: {code} 21/01/15 18:44:32 ERROR KubernetesUtils: Encountered exception while attempting to load initial pod spec from file java.lang.NullPointerException at org.apache.spark.deploy.k8s.KubernetesUtils$.selectSparkContainer(KubernetesUtils.scala:108) at org.apache.spark.deploy.k8s.KubernetesUtils$.loadPodFromTemplate(KubernetesUtils.scala:88) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$1(KubernetesDriverBuilder.scala:36) at scala.Option.map(Option.scala:230) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:32) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} {code:java} $> cat empty-template.yml spec: {code} {code} $> ./bin/run-example \ --master k8s://$K8S_SERVER \ --deploy-mode cluster \ --conf spark.kubernetes.driver.podTemplateFile=empty-template.yml \ --name $POD_NAME \ --jars local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar \ --conf spark.kubernetes.container.image=spark:v3.0.1 \ --conf spark.kubernetes.driver.pod.name=$POD_NAME \ --conf spark.kubernetes.namespace=spark-demo \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --verbose \ SparkPi 10 {code} It appears that the implicit requirement is that there's at least one well-defined container of any name (not necessarily {{spark.kubernetes.driver.podTemplateContainerName}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34024) datasourceV1 VS dataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-34024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski resolved SPARK-34024. - Resolution: Invalid Please post questions to the u...@spark.apache.org mailing list or on StackOverflow at https://stackoverflow.com/questions/tagged/apache-spark > datasourceV1 VS dataSourceV2 > -- > > Key: SPARK-34024 > URL: https://issues.apache.org/jira/browse/SPARK-34024 > Project: Spark > Issue Type: Question > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: Zhenglin luo >Priority: Critical > > I found that DataSourceV2 has been through many versions .So why hasn't > datasourceV2 been used by default until now in the latest version.I want to > know if it’s because there is a big difference in execution efficiency > between v1 and v2. Or there are other reasons. > Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources
[ https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860013#comment-16860013 ] Jacek Laskowski commented on SPARK-27708: - [~rdblue] Mind if I asked you to update the requirements (= answer my questions)? Thanks. > Add documentation for v2 data sources > - > > Key: SPARK-27708 > URL: https://issues.apache.org/jira/browse/SPARK-27708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: documentation > > Before the 3.0 release, the new v2 data sources should be documented. This > includes: > * How to plug in catalog implementations > * Catalog plugin configuration > * Multi-part identifier behavior > * Partition transforms > * Table properties that are used to pass table info (e.g. "provider") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27977) MicroBatchWriter should use StreamWriter for human-friendly textual representation (toString)
Jacek Laskowski created SPARK-27977: --- Summary: MicroBatchWriter should use StreamWriter for human-friendly textual representation (toString) Key: SPARK-27977 URL: https://issues.apache.org/jira/browse/SPARK-27977 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Jacek Laskowski The following is a extended explain for a streaming query: {code} == Parsed Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- Project [value#39 AS value#0] +- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=]) == Analyzed Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- Project [value#39 AS value#0] +- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=]) == Optimized Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=]) == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- *(1) Project [value#39] +- *(1) ScanV2 socket[value#39] (Options: [host=localhost,port=]) {code} As you may have noticed, {{WriteToDataSourceV2}} is followed by the internal representation of {{MicroBatchWriter}} that is a mere adapter for {{StreamWriter}}, e.g. {{ConsoleWriter}}. It'd be more debugging-friendly if the plans included whatever {{StreamWriter.toString}} would (which in case of {{ConsoleWriter}} would be {{ConsoleWriter[numRows=..., truncate=...]}} which gives more context). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27975) ConsoleSink should display alias and options for streaming progress
Jacek Laskowski created SPARK-27975: --- Summary: ConsoleSink should display alias and options for streaming progress Key: SPARK-27975 URL: https://issues.apache.org/jira/browse/SPARK-27975 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Jacek Laskowski {{console}} sink shows itself in progress with this internal instance representation as follows: {code:json} "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a" } {code} That is not very user-friendly and would be much better for debugging if it included the alias and options as {{socket}} does: {code} "sources" : [ { "description" : "TextSocketV2[host: localhost, port: ]", ... } ], {code} The entire sample progress looks as follows: {code} 19/06/07 11:47:18 INFO MicroBatchExecution: Streaming query made progress: { "id" : "26bedc9f-076f-4b15-8e17-f09609aaecac", "runId" : "8c365e74-7ac9-4fad-bf1b-397eb086661e", "name" : "socket-console", "timestamp" : "2019-06-07T09:47:18.969Z", "batchId" : 2, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "durationMs" : { "getEndOffset" : 0, "setOffsetRange" : 0, "triggerExecution" : 0 }, "stateOperators" : [ ], "sources" : [ { "description" : "TextSocketV2[host: localhost, port: ]", "startOffset" : 0, "endOffset" : 0, "numInputRows" : 0, "inputRowsPerSecond" : 0.0 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a" } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution
[ https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-27933: Summary: Extracting common purge "behaviour" to the parent StreamExecution (was: Introduce StreamExecution.purge for removing entries from metadata logs) > Extracting common purge "behaviour" to the parent StreamExecution > - > > Key: SPARK-27933 > URL: https://issues.apache.org/jira/browse/SPARK-27933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Jacek Laskowski >Priority: Minor > > Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27933) Introduce StreamExecution.purge for removing entries from metadata logs
Jacek Laskowski created SPARK-27933: --- Summary: Introduce StreamExecution.purge for removing entries from metadata logs Key: SPARK-27933 URL: https://issues.apache.org/jira/browse/SPARK-27933 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Jacek Laskowski Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources
[ https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839846#comment-16839846 ] Jacek Laskowski commented on SPARK-27708: - What's really needed? I've been reviewing the code of the new v2 data sources, but don't understand "How to plug in catalog implementations" and the others. I'd like to help with the docs and would like to start with "How to plug in catalog implementations". Please guide. > Add documentation for v2 data sources > - > > Key: SPARK-27708 > URL: https://issues.apache.org/jira/browse/SPARK-27708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > Labels: documentation > > Before the 3.0 release, the new v2 data sources should be documented. This > includes: > * How to plug in catalog implementations > * Catalog plugin configuration > * Multi-part identifier behavior > * Partition transforms > * Table properties that are used to pass table info (e.g. "provider") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27708) Add documentation for v2 data sources
[ https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-27708: Labels: documentation (was: docuentation) > Add documentation for v2 data sources > - > > Key: SPARK-27708 > URL: https://issues.apache.org/jira/browse/SPARK-27708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > Labels: documentation > > Before the 3.0 release, the new v2 data sources should be documented. This > includes: > * How to plug in catalog implementations > * Catalog plugin configuration > * Multi-part identifier behavior > * Partition transforms > * Table properties that are used to pass table info (e.g. "provider") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic
[ https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765971#comment-16765971 ] Jacek Laskowski commented on SPARK-20597: - [~nimfadora] sure. go ahead. > KafkaSourceProvider falls back on path as synonym for topic > --- > > Key: SPARK-20597 > URL: https://issues.apache.org/jira/browse/SPARK-20597 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic > to save a DataFrame's rows to > # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka > topics for writing > What seems a quite interesting option is to support {{start(path: String)}} > as the least precedence option in which {{path}} would designate the default > topic when no other options are used. > {code} > df.writeStream.format("kafka").start("topic") > {code} > See > http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html > for discussion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26063) CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on unresolved object" when requested for numberedTreeString
Jacek Laskowski created SPARK-26063: --- Summary: CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on unresolved object" when requested for numberedTreeString Key: SPARK-26063 URL: https://issues.apache.org/jira/browse/SPARK-26063 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Jacek Laskowski The following gives {{org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'id}}: {code:java} // ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0 scala> spark.version res0: String = 2.4.0 import org.apache.spark.sql.avro._ val q = spark.range(1).withColumn("to_avro_id", to_avro('id)) val logicalPlan = q.queryExecution.logical scala> logicalPlan.expressions.drop(1).head.numberedTreeString org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'id at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.avro.CatalystDataToAvro.simpleString(CatalystDataToAvro.scala:56) at org.apache.spark.sql.catalyst.expressions.Expression.verboseString(Expression.scala:233) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:569) at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472) at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:469) at org.apache.spark.sql.catalyst.trees.TreeNode.numberedTreeString(TreeNode.scala:483) ... 51 elided{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)
Jacek Laskowski created SPARK-26062: --- Summary: Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka) Key: SPARK-26062 URL: https://issues.apache.org/jira/browse/SPARK-26062 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Jacek Laskowski Given the name of {{spark-sql-kafka}} external module it seems appropriate (and consistent) to rename {{spark-avro}} external module to {{spark-sql-avro}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences
[ https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-25278: Description: When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view (as a LocalTableScan), but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelations}} may also be affected). was: When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view, but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelations}} may also be affected). > Number of output rows metric of union of views is multiplied by their > occurrences > - > > Key: SPARK-25278 > URL: https://issues.apache.org/jira/browse/SPARK-25278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > Attachments: union-2-views.png, union-3-views.png > > > When you use a view in a union multiple times (self-union), the {{number of > output rows}} metric seems to be the correct {{number of output rows}} > multiplied by the occurrences of the view, e.g. > {code:java} > scala> spark.version > res0: String = 2.3.1 > val name = "demo_view" > sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") > assert(spark.catalog.tableExists(name)) > val view = spark.table(name) > assert(view.count == 2) > view.union(view).show // gives 4 for every view (as a LocalTableScan), but > should be 2 > view.union(view).union(view).show // gives 6{code} > I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} > (and think other {{MultiInstanceRelations}} may also be affected). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences
[ https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-25278: Description: When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view, but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelation}}s may also be affected). was: When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view, but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelation}}s may also be affected). > Number of output rows metric of union of views is multiplied by their > occurrences > - > > Key: SPARK-25278 > URL: https://issues.apache.org/jira/browse/SPARK-25278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > Attachments: union-2-views.png, union-3-views.png > > > When you use a view in a union multiple times (self-union), the {{number of > output rows}} metric seems to be the correct {{number of output rows}} > multiplied by the occurrences of the view, e.g. > {code:java} > scala> spark.version > res0: String = 2.3.1 > val name = "demo_view" > sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") > assert(spark.catalog.tableExists(name)) > val view = spark.table(name) > assert(view.count == 2) > view.union(view).show // gives 4 for every view, but should be 2 > view.union(view).union(view).show // gives 6{code} > I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} > (and think other {{MultiInstanceRelation}}s may also be affected). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences
[ https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-25278: Description: When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view, but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelations}} may also be affected). was: When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view, but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelation}}s may also be affected). > Number of output rows metric of union of views is multiplied by their > occurrences > - > > Key: SPARK-25278 > URL: https://issues.apache.org/jira/browse/SPARK-25278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > Attachments: union-2-views.png, union-3-views.png > > > When you use a view in a union multiple times (self-union), the {{number of > output rows}} metric seems to be the correct {{number of output rows}} > multiplied by the occurrences of the view, e.g. > {code:java} > scala> spark.version > res0: String = 2.3.1 > val name = "demo_view" > sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") > assert(spark.catalog.tableExists(name)) > val view = spark.table(name) > assert(view.count == 2) > view.union(view).show // gives 4 for every view, but should be 2 > view.union(view).union(view).show // gives 6{code} > I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} > (and think other {{MultiInstanceRelations}} may also be affected). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences
[ https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-25278: Attachment: union-3-views.png > Number of output rows metric of union of views is multiplied by their > occurrences > - > > Key: SPARK-25278 > URL: https://issues.apache.org/jira/browse/SPARK-25278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > Attachments: union-2-views.png, union-3-views.png > > > When you use a view in a union multiple times (self-union), the {{number of > output rows}} metric seems to be the correct {{number of output rows}} > multiplied by the occurrences of the view, e.g. > {code:java} > scala> spark.version > res0: String = 2.3.1 > val name = "demo_view" > sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") > assert(spark.catalog.tableExists(name)) > val view = spark.table(name) > assert(view.count == 2) > view.union(view).show // gives 4 for every view, but should be 2 > view.union(view).union(view).show // gives 6{code} > I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} > (and think other {{MultiInstanceRelation}}s may also be affected). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences
Jacek Laskowski created SPARK-25278: --- Summary: Number of output rows metric of union of views is multiplied by their occurrences Key: SPARK-25278 URL: https://issues.apache.org/jira/browse/SPARK-25278 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Jacek Laskowski Attachments: union-2-views.png, union-3-views.png When you use a view in a union multiple times (self-union), the {{number of output rows}} metric seems to be the correct {{number of output rows}} multiplied by the occurrences of the view, e.g. {code:java} scala> spark.version res0: String = 2.3.1 val name = "demo_view" sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") assert(spark.catalog.tableExists(name)) val view = spark.table(name) assert(view.count == 2) view.union(view).show // gives 4 for every view, but should be 2 view.union(view).union(view).show // gives 6{code} I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} (and think other {{MultiInstanceRelation}}s may also be affected). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences
[ https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-25278: Attachment: union-2-views.png > Number of output rows metric of union of views is multiplied by their > occurrences > - > > Key: SPARK-25278 > URL: https://issues.apache.org/jira/browse/SPARK-25278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > Attachments: union-2-views.png, union-3-views.png > > > When you use a view in a union multiple times (self-union), the {{number of > output rows}} metric seems to be the correct {{number of output rows}} > multiplied by the occurrences of the view, e.g. > {code:java} > scala> spark.version > res0: String = 2.3.1 > val name = "demo_view" > sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2") > assert(spark.catalog.tableExists(name)) > val view = spark.table(name) > assert(view.count == 2) > view.union(view).show // gives 4 for every view, but should be 2 > view.union(view).union(view).show // gives 6{code} > I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} > (and think other {{MultiInstanceRelation}}s may also be affected). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic
[ https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559412#comment-16559412 ] Jacek Laskowski commented on SPARK-20597: - Sorry [~Satyajit] for not responding earlier. I'd like to get back to it and am wondering if you'd still like to work on this ticket? I'd appreciate if you used a pull request to discuss issues if there are any. Thanks. > KafkaSourceProvider falls back on path as synonym for topic > --- > > Key: SPARK-20597 > URL: https://issues.apache.org/jira/browse/SPARK-20597 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic > to save a DataFrame's rows to > # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka > topics for writing > What seems a quite interesting option is to support {{start(path: String)}} > as the least precedence option in which {{path}} would designate the default > topic when no other options are used. > {code} > df.writeStream.format("kafka").start("topic") > {code} > See > http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html > for discussion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24899) Add example of monotonically_increasing_id standard function to scaladoc
Jacek Laskowski created SPARK-24899: --- Summary: Add example of monotonically_increasing_id standard function to scaladoc Key: SPARK-24899 URL: https://issues.apache.org/jira/browse/SPARK-24899 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.3.1 Reporter: Jacek Laskowski I think an example of {{monotonically_increasing_id}} standard function in scaladoc would help people understand why the function is monotonically increasing and unique but not consecutive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24408) Move abs function to math_funcs group
[ https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-24408: Summary: Move abs function to math_funcs group (was: Move abs, bitwiseNOT, isnan, nanvl functions to math_funcs group) > Move abs function to math_funcs group > - > > Key: SPARK-24408 > URL: https://issues.apache.org/jira/browse/SPARK-24408 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are > not in {{math_funcs}} group. They should really be. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24490) Use WebUI.addStaticHandler in web UIs
Jacek Laskowski created SPARK-24490: --- Summary: Use WebUI.addStaticHandler in web UIs Key: SPARK-24490 URL: https://issues.apache.org/jira/browse/SPARK-24490 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.0 Reporter: Jacek Laskowski {{WebUI}} defines {{addStaticHandler}} that web UIs don't use (and simply introduce duplication). Let's clean them up and remove duplications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24408) Move abs, bitwiseNOT, isnan, nanvl functions to math_funcs group
Jacek Laskowski created SPARK-24408: --- Summary: Move abs, bitwiseNOT, isnan, nanvl functions to math_funcs group Key: SPARK-24408 URL: https://issues.apache.org/jira/browse/SPARK-24408 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.3.0 Reporter: Jacek Laskowski A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are not in {{math_funcs}} group. They should really be. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445520#comment-16445520 ] Jacek Laskowski commented on SPARK-24025: - it seems related or duplicated > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445512#comment-16445512 ] Jacek Laskowski commented on SPARK-24025: - I was about to have closed this as a duplicate, but I'm not so sure anymore. Reading the following in SPARK-17570: {quote}If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those. {quote} I'm no longer so sure that the two issues are perfectly identical, but they're close "relatives". I think the next step would be to write a test case to reproduce this issue and the fairly simple fix would be to have an physical optimization (rule) that would eliminate one extra Exchange and Sort ops. Who'd help me here? > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444998#comment-16444998 ] Jacek Laskowski commented on SPARK-24025: - The other issue seems similar. > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
[ https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-24025: Attachment: join-jira.png > Join of bucketed and non-bucketed tables can give two exchanges and sorts for > non-bucketed side > --- > > Key: SPARK-24025 > URL: https://issues.apache.org/jira/browse/SPARK-24025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 > Branch master > Compiled by user sameera on 2018-02-22T19:24:29Z > Revision a0d7949896e70f427e7f3942ff340c9484ff0aab > Url g...@github.com:sameeragarwal/spark.git > Type --help for more information.{code} >Reporter: Jacek Laskowski >Priority: Major > Attachments: join-jira.png > > > While exploring bucketing I found the following join query of non-bucketed > and bucketed tables that ends up with two exchanges and two sorts in the > physical plan for the non-bucketed join side. > {code} > // Make sure that you don't end up with a BroadcastHashJoin and a > BroadcastExchange > // Disable auto broadcasting > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > val bucketedTableName = "bucketed_4_id" > val large = spark.range(100) > large.write > .bucketBy(4, "id") > .sortBy("id") > .mode("overwrite") > .saveAsTable(bucketedTableName) > // Describe the table and include bucketing spec only > val descSQL = sql(s"DESC FORMATTED > $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === > "Sort Columns") > scala> descSQL.show(truncate = false) > +--+-+---+ > |col_name |data_type|comment| > +--+-+---+ > |Num Buckets |4| | > |Bucket Columns|[`id`] | | > |Sort Columns |[`id`] | | > +--+-+---+ > val bucketedTable = spark.table(bucketedTableName) > val t1 = spark.range(4) > .repartition(2, $"id") // Use just 2 partitions > .sortWithinPartitions("id") // sort partitions > val q = t1.join(bucketedTable, "id") > // Note two exchanges and sorts > scala> q.explain > == Physical Plan == > *(5) Project [id#79L] > +- *(5) SortMergeJoin [id#79L], [id#77L], Inner >:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#79L, 4) >: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 >:+- Exchange hashpartitioning(id#79L, 2) >: +- *(1) Range (0, 4, step=1, splits=8) >+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 > +- *(4) Project [id#77L] > +- *(4) Filter isnotnull(id#77L) > +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > q.foreach(_ => ()) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
Jacek Laskowski created SPARK-24025: --- Summary: Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side Key: SPARK-24025 URL: https://issues.apache.org/jira/browse/SPARK-24025 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.3.1 Environment: {code:java} ./bin/spark-shell --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162 Branch master Compiled by user sameera on 2018-02-22T19:24:29Z Revision a0d7949896e70f427e7f3942ff340c9484ff0aab Url g...@github.com:sameeragarwal/spark.git Type --help for more information.{code} Reporter: Jacek Laskowski While exploring bucketing I found the following join query of non-bucketed and bucketed tables that ends up with two exchanges and two sorts in the physical plan for the non-bucketed join side. {code} // Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange // Disable auto broadcasting spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) val bucketedTableName = "bucketed_4_id" val large = spark.range(100) large.write .bucketBy(4, "id") .sortBy("id") .mode("overwrite") .saveAsTable(bucketedTableName) // Describe the table and include bucketing spec only val descSQL = sql(s"DESC FORMATTED $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === "Sort Columns") scala> descSQL.show(truncate = false) +--+-+---+ |col_name |data_type|comment| +--+-+---+ |Num Buckets |4| | |Bucket Columns|[`id`] | | |Sort Columns |[`id`] | | +--+-+---+ val bucketedTable = spark.table(bucketedTableName) val t1 = spark.range(4) .repartition(2, $"id") // Use just 2 partitions .sortWithinPartitions("id") // sort partitions val q = t1.join(bucketedTable, "id") // Note two exchanges and sorts scala> q.explain == Physical Plan == *(5) Project [id#79L] +- *(5) SortMergeJoin [id#79L], [id#77L], Inner :- *(3) Sort [id#79L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#79L, 4) : +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0 :+- Exchange hashpartitioning(id#79L, 2) : +- *(1) Range (0, 4, step=1, splits=8) +- *(4) Sort [id#77L ASC NULLS FIRST], false, 0 +- *(4) Project [id#77L] +- *(4) Filter isnotnull(id#77L) +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct q.foreach(_ => ()) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442488#comment-16442488 ] Jacek Laskowski commented on SPARK-23830: - It's about how easy it is to find out that the issue is `class` vs `object`. If that's just a single change that could be reported to end users to help them I think it's worth it. > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
Jacek Laskowski created SPARK-23830: --- Summary: Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object Key: SPARK-23830 URL: https://issues.apache.org/jira/browse/SPARK-23830 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.3.0 Reporter: Jacek Laskowski As reported on StackOverflow in [Why does Spark on YARN fail with “Exception in thread ”Driver“ java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] the following Spark application fails with {{Exception in thread "Driver" java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: {code} class MyClass { def main(args: Array[String]): Unit = { val c = new MyClass() c.process() } def process(): Unit = { val sparkConf = new SparkConf().setAppName("my-test") val sparkSession: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ } ... } {code} The exception is as follows: {code} 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a separate Thread 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context initialization... Exception in thread "Driver" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) {code} I think the reason for the exception {{Exception in thread "Driver" java.lang.NullPointerException}} is due to [the following code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: {code} val mainMethod = userClassLoader.loadClass(args.userClass) .getMethod("main", classOf[Array[String]]) {code} So when {{mainMethod}} is used in [the following code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] it simply gives NPE. {code} mainMethod.invoke(null, userArgs.toArray) {code} That could be easily avoided with an extra check if the {{mainMethod}} is initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
Jacek Laskowski created SPARK-23731: --- Summary: FileSourceScanExec throws NullPointerException in subexpression elimination Key: SPARK-23731 URL: https://issues.apache.org/jira/browse/SPARK-23731 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1, 2.3.1 Reporter: Jacek Laskowski While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I faced the following exception (in Spark 2.3.0): {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) at org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) at org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) at org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) at scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.get(HashMap.scala:70) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$$anonfun$addExprTree$1.apply(EquivalentExpressions.scala:96) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$$anonfun$addExprTree$1.apply(EquivalentExpressions.scala:96) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:96) at org.apache.spark.sql.cataly
[jira] [Commented] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable
[ https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398899#comment-16398899 ] Jacek Laskowski commented on SPARK-20536: - I'm not sure how meaningful it still is, but given it's still open, it could be fixed. > Extend ColumnName to create StructFields with explicit nullable > --- > > Key: SPARK-20536 > URL: https://issues.apache.org/jira/browse/SPARK-20536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > > {{ColumnName}} defines methods to create {{StructFields}}. > It'd be very user-friendly if there were methods to create {{StructFields}} > with explicit {{nullable}} property (currently implicitly {{true}}). > That could look as follows: > {code} > // E.g. def int: StructField = StructField(name, IntegerType) > def int(nullable: Boolean): StructField = StructField(name, IntegerType, > nullable) > // or (untested) > def int(nullable: Boolean): StructField = int.copy(nullable = nullable) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23229) Dataset.hint should use planWithBarrier logical plan
Jacek Laskowski created SPARK-23229: --- Summary: Dataset.hint should use planWithBarrier logical plan Key: SPARK-23229 URL: https://issues.apache.org/jira/browse/SPARK-23229 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Jacek Laskowski Every time {{Dataset.hint}} is used it triggers execution of logical commands, their unions and hint resolution (among other things that analyzer does). {{hint}} should use {{planWithBarrier}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided
[ https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326959#comment-16326959 ] Jacek Laskowski commented on SPARK-22457: - That should be fairly easy to fix _iff_ we want to restrict the formats to {{FileFormat}} (that the mentioned formats are subtypes of). Care to submit a pull request with the places where {{path}} is used to limit their scope to {{FileFormats}} only? (that would help draw more attention to the issue). > Tables are supposed to be MANAGED only taking into account whether a path is > provided > - > > Key: SPARK-22457 > URL: https://issues.apache.org/jira/browse/SPARK-22457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: David Arroyo >Priority: Major > > As far as I know, since Spark 2.2, tables are supposed to be MANAGED only > taking into account whether a path is provided: > {code:java} > val tableType = if (storage.locationUri.isDefined) { > CatalogTableType.EXTERNAL > } else { > CatalogTableType.MANAGED > } > {code} > This solution seems to be right for filesystem based data sources. On the > other hand, when working with other data sources such as elasticsearch, that > solution is leading to a weird behaviour described below: > 1) InMemoryCatalog's doCreateTable() adds a locationURI if > CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. > 2) Before loading the data source table FindDataSourceTable's > readDataSourceTable() adds a path option if locationURI exists: > {code:java} > val pathOption = table.storage.locationUri.map("path" -> > CatalogUtils.URIToString(_)) > {code} > 3) That causes an error when reading from elasticsearch because 'path' is an > option already supported by elasticsearch (locationUri is set to > file:/home/user/spark-rv/elasticsearch/shop/clients) > org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find > mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is > required before using Spark SQL > Would be possible only to mark tables as MANAGED for a subset of data sources > (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? > P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which > from my point of view should only be required for filesystem based data > sources: > {code:java} >if (tableMeta.tableType == CatalogTableType.MANAGED) >... >// Delete the data/directory of the table > val dir = new Path(tableMeta.location) > try { > val fs = dir.getFileSystem(hadoopConfig) > fs.delete(dir, true) > } catch { > case e: IOException => > throw new SparkException(s"Unable to drop table $table as failed > " + > s"to delete its directory $dir", e) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")
Jacek Laskowski created SPARK-22954: --- Summary: ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views") Key: SPARK-22954 URL: https://issues.apache.org/jira/browse/SPARK-22954 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: {code} $ ./bin/spark-shell --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 Branch master Compiled by user jacek on 2018-01-04T05:44:05Z Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 {code} Reporter: Jacek Laskowski Priority: Minor {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' not found in database 'default';}} for temporary tables (views) while the reason is that it can only work with permanent tables (which [it can report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38] if it had a chance). {code} scala> names.createOrReplaceTempView("names") scala> sql("ANALYZE TABLE names COMPUTE STATISTICS") org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'names' not found in database 'default'; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398) at org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243) at org.apache.spark.sql.Dataset.(Dataset.scala:187) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) ... 50 elided {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException
[ https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308363#comment-16308363 ] Jacek Laskowski commented on SPARK-22935: - It does not seem to be the case as described in https://stackoverflow.com/q/48026060/1305344 where the OP wanted to {{inferSchema}} with no values that would be of {{Date}} or {{Timestamp}} and hence Spark SQL infers strings. But...I think all's fine though as I said at SO: {quote} TL;DR Define the schema explicitly since the input dataset does not have values to infer types from (for java.sql.Date fields). {quote} I think you can close the issue as {{Invalid}}. > Dataset with Java Beans for java.sql.Date throws CompileException > - > > Key: SPARK-22935 > URL: https://issues.apache.org/jira/browse/SPARK-22935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Kazuaki Ishizaki > > The following code can throw an exception with or without whole-stage codegen. > {code} > public void SPARK22935() { > Dataset cdr = spark > .read() > .format("csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", ";") > .csv("CDR_SAMPLE.csv") > .as(Encoders.bean(CDR.class)); > Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != > null)); > long c = ds.count(); > cdr.show(2); > ds.show(2); > System.out.println("cnt=" + c); > } > // CDR.java > public class CDR implements java.io.Serializable { > public java.sql.Date timestamp; > public java.sql.Date getTimestamp() { return this.timestamp; } > public void setTimestamp(java.sql.Date timestamp) { this.timestamp = > timestamp; } > } > // CDR_SAMPLE.csv > timestamp > 2017-10-29T02:37:07.815Z > 2017-10-29T02:38:07.815Z > {code} > result > {code} > 12:17:10.352 ERROR > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 61, Column 70: No applicable constructor/method found > for actual parameters "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 61, Column 70: No applicable constructor/method found for actual parameters > "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) > ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages
[ https://issues.apache.org/jira/browse/SPARK-22929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307149#comment-16307149 ] Jacek Laskowski commented on SPARK-22929: - When I saw the issue I was so much surprised as that's perhaps of the most often used data sources on...StackOverflow :) But then I'm not using pyspark (and so `spark-submit` may have got hosed for why it deals with python). > Short name for "kafka" doesn't work in pyspark with packages > > > Key: SPARK-22929 > URL: https://issues.apache.org/jira/browse/SPARK-22929 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Critical > > When I start pyspark using the following command: > {code} > bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 > {code} > The following throws an error: > {code} > spark.read.format("kakfa")... > py4j.protocol.Py4JJavaError: An error occurred while calling o35.load. > : java.lang.ClassNotFoundException: Failed to find data source: kakfa. Please > find packages at http://spark.apache.org/third-party-projects.html > {code} > The following does work: > {code} > spark.read.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22048) Show id, runId, batch in Description column in SQL tab for streaming queries (as in Jobs)
[ https://issues.apache.org/jira/browse/SPARK-22048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-22048: Attachment: webui-jobs-description.png webui-sql-description.png > Show id, runId, batch in Description column in SQL tab for streaming queries > (as in Jobs) > - > > Key: SPARK-22048 > URL: https://issues.apache.org/jira/browse/SPARK-22048 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: webui-jobs-description.png, webui-sql-description.png > > > web UI's Jobs tab shows {{id}}, {{runId}} and {{batch}} of every streaming > batch (which is very handy), but think SQL tab would benefit from it too > (perhaps even more given that's the tab where SQL queries are displayed). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22048) Show id, runId, batch in Description column in SQL tab for streaming queries (as in Jobs)
Jacek Laskowski created SPARK-22048: --- Summary: Show id, runId, batch in Description column in SQL tab for streaming queries (as in Jobs) Key: SPARK-22048 URL: https://issues.apache.org/jira/browse/SPARK-22048 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor web UI's Jobs tab shows {{id}}, {{runId}} and {{batch}} of every streaming batch (which is very handy), but think SQL tab would benefit from it too (perhaps even more given that's the tab where SQL queries are displayed). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22044) explain function with codegen and cost parameters
Jacek Laskowski created SPARK-22044: --- Summary: explain function with codegen and cost parameters Key: SPARK-22044 URL: https://issues.apache.org/jira/browse/SPARK-22044 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor {{explain}} operator creates {{ExplainCommand}} runnable command that accepts (among other things) {{codegen}} and {{cost}} arguments. There's no version of {{explain}} to allow for this. That's however possible using SQL which is kind of surprising (given how much focus is devoted to the Dataset API). This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, i.e. {code} def explain(codegen: Boolean = false, cost: Boolean = false): Unit {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22040) current_date function with timezone id
[ https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169215#comment-16169215 ] Jacek Laskowski commented on SPARK-22040: - That'd be awesome! It's yours, [~mgaido] > current_date function with timezone id > -- > > Key: SPARK-22040 > URL: https://issues.apache.org/jira/browse/SPARK-22040 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > {{current_date}} function creates {{CurrentDate}} expression that accepts > optional timezone id, but there's no function to allow for this. > This is to have another {{current_date}} with the timezone id, i.e. > {code} > def current_date(timeZoneId: String): Column > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22040) current_date function with timezone id
Jacek Laskowski created SPARK-22040: --- Summary: current_date function with timezone id Key: SPARK-22040 URL: https://issues.apache.org/jira/browse/SPARK-22040 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor {{current_date}} function creates {{CurrentDate}} expression that accepts optional timezone id, but there's no function to allow for this. This is to have another {{current_date}} with the timezone id, i.e. {code} def current_date(timeZoneId: String): Column {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21901) Define toString for StateOperatorProgress
Jacek Laskowski created SPARK-21901: --- Summary: Define toString for StateOperatorProgress Key: SPARK-21901 URL: https://issues.apache.org/jira/browse/SPARK-21901 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Trivial {{StateOperatorProgress}} should define its own {{toString}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
Jacek Laskowski created SPARK-21886: --- Summary: Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator Key: SPARK-21886 URL: https://issues.apache.org/jira/browse/SPARK-21886 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor While exploring where {{LogicalRDD}} is created I noticed that there are a few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is to simply re-use the method wherever possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-21728: Attachment: logging.patch sparksubmit.patch > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > Attachments: logging.patch, sparksubmit.patch > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147957#comment-16147957 ] Jacek Laskowski commented on SPARK-21728: - After I changed your change, I could see the logs again. No idea if the changes made sense or not, but see the logs and that counts :) I'm attaching my changes if these could help you somehow. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147840#comment-16147840 ] Jacek Laskowski commented on SPARK-21728: - The idea behind the custom {{conf/log4j.properties}} is to disable all the logging and enable only {{org.apache.spark.sql.execution.streaming}} currently. {code} $ cat conf/log4j.properties # Set everything to be logged to the console log4j.rootCategory=OFF, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=WARN # Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=WARN log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR #log4j.logger.org.apache.spark=OFF log4j.logger.org.apache.spark.metrics.MetricsSystem=WARN # Structured Streaming log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG log4j.logger.org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec=INFO {code} > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147643#comment-16147643 ] Jacek Laskowski commented on SPARK-21728: - Thanks [~vanzin] for the prompt response! I'm stuck with the change as {{conf/log4j.properties}} has no effect on logging and given the change touched it I think it's the root cause (I might be mistaken, but looking for help to find it). The following worked fine two days ago (not sure about yesterday's build). Is {{conf/log4j.properties}} still the file for logging? {code} # Structured Streaming log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG {code} > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147086#comment-16147086 ] Jacek Laskowski commented on SPARK-21728: - Thanks [~sowen]. I'll label it as such when I know if it merits one (looks so, but waiting for a response from [~vanzin] or others who'd know). Sent out an email to the Spark user mailing list today. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146780#comment-16146780 ] Jacek Laskowski commented on SPARK-21728: - I think the change is user-visible and therefore deserves to be included in the release notes for 2.3 (I remember a component or label to mark changes like that in a special way) /cc [~sowen] [~hyukjin.kwon] > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142921#comment-16142921 ] Jacek Laskowski commented on SPARK-21765: - BTW, *Assignee* field of the JIRA is empty, but should be [~joseph.torres] if I'm not mistaken. > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > > Key: SPARK-21765 > URL: https://issues.apache.org/jira/browse/SPARK-21765 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > Fix For: 3.0.0 > > > LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some > streaming sources don't set the bit, and the bit can sometimes be lost in > rewriting. Setting the bit for all plans that are logically streaming will > help us simplify the logic around checking query plan validity. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21765) Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
[ https://issues.apache.org/jira/browse/SPARK-21765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142919#comment-16142919 ] Jacek Laskowski commented on SPARK-21765: - I think {{TextSocketSource}} was missed in the change and does not enable {{isStreaming}} flag leading to the following error: {code} val counts = spark. readStream. format("rate"). load. groupBy(window($"timestamp", "5 seconds") as "group"). agg(count("value") as "value_count") import scala.concurrent.duration._ import org.apache.spark.sql.streaming.{OutputMode, Trigger} val sq = counts. writeStream. format("console"). option("truncate", false). trigger(Trigger.ProcessingTime(1.hour)). outputMode(OutputMode.Complete). start 17/08/26 21:16:20 ERROR StreamExecution: Query [id = 980bfeba-5433-49db-9d20-055c965b25f3, runId = b711948f-4def-4205-ac43-ec1e5b852fdb] terminated with error java.lang.AssertionError: assertion failed: DataFrame returned by getBatch from TextSocketSource[host: localhost, port: ] did not have isStreaming=true Project [_1#66 AS value#71] +- Project [_1#66] +- LocalRelation [_1#66, _2#67] at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$10.apply(StreamExecution.scala:641) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$10.apply(StreamExecution.scala:636) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:636) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:636) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:274) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:60) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:635) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:313) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:301) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:301) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:274) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:60) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:301) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:297) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:213) {code} The change seems simple and {{TextSocketSource.getBatch}} should be as follows: {code} val rdd = sqlContext.sparkContext.parallelize(rawList).map(v => InternalRow(v)) val rawBatch = sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = true) {code} > Ensure all leaf nodes that are derived from streaming sources have > isStreaming=true > --- > >
[jira] [Commented] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option
[ https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120576#comment-16120576 ] Jacek Laskowski commented on SPARK-21667: - Oh, what an offer! Couldn't have thought of a better one today :) Let me see how far my hope to fix it leads me. I'm on it. > ConsoleSink should not fail streaming query with checkpointLocation option > -- > > Key: SPARK-21667 > URL: https://issues.apache.org/jira/browse/SPARK-21667 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > > As agreed on the Spark users mailing list in the thread "\[SS] Console sink > not supporting recovering from checkpoint location? Why?" in which > [~marmbrus] said: > {quote} > I think there is really no good reason for this limitation. > {quote} > Using {{ConsoleSink}} should therefore not fail a streaming query when used > with {{checkpointLocation}} option. > {code} > // today's build from the master > scala> spark.version > res8: String = 2.3.0-SNAPSHOT > scala> val q = records. > | writeStream. > | format("console"). > | option("truncate", false). > | option("checkpointLocation", "/tmp/checkpoint"). // <-- > checkpoint directory > | trigger(Trigger.ProcessingTime(10.seconds)). > | outputMode(OutputMode.Update). > | start > org.apache.spark.sql.AnalysisException: This query does not support > recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start > over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284) > ... 61 elided > {code} > The "trigger" is SPARK-16116 and [this > line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277] > in particular. > This also relates to SPARK-19768 that was resolved as not a bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option
Jacek Laskowski created SPARK-21667: --- Summary: ConsoleSink should not fail streaming query with checkpointLocation option Key: SPARK-21667 URL: https://issues.apache.org/jira/browse/SPARK-21667 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Minor As agreed on the Spark users mailing list in the thread "\[SS] Console sink not supporting recovering from checkpoint location? Why?" in which [~marmbrus] said: {quote} I think there is really no good reason for this limitation. {quote} Using {{ConsoleSink}} should therefore not fail a streaming query when used with {{checkpointLocation}} option. {code} // today's build from the master scala> spark.version res8: String = 2.3.0-SNAPSHOT scala> val q = records. | writeStream. | format("console"). | option("truncate", false). | option("checkpointLocation", "/tmp/checkpoint"). // <-- checkpoint directory | trigger(Trigger.ProcessingTime(10.seconds)). | outputMode(OutputMode.Update). | start org.apache.spark.sql.AnalysisException: This query does not support recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start over.; at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284) ... 61 elided {code} The "trigger" is SPARK-16116 and [this line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277] in particular. This also relates to SPARK-19768 that was resolved as not a bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure
[ https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-21546: Description: With today's master... The following streaming query with watermark and {{dropDuplicates}} yields {{RuntimeException}} due to failure in binding. {code} val topic1 = spark. readStream. format("kafka"). option("subscribe", "topic1"). option("kafka.bootstrap.servers", "localhost:9092"). option("startingoffsets", "earliest"). load val records = topic1. withColumn("eventtime", 'timestamp). // <-- just to put the right name given the purpose withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- use the renamed eventtime column dropDuplicates("value"). // dropDuplicates will use watermark // only when eventTime column exists // include the watermark column => internal design leak? select('key cast "string", 'value cast "string", 'eventtime). as[(String, String, java.sql.Timestamp)] scala> records.explain == Physical Plan == *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS value#170, eventtime#157-T3ms] +- StreamingDeduplicate [value#1], StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0 +- Exchange hashpartitioning(value#1, 200) +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds +- *Project [key#0, value#1, timestamp#5 AS eventtime#157] +- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6] import org.apache.spark.sql.streaming.{OutputMode, Trigger} val sq = records. writeStream. format("console"). option("truncate", false). trigger(Trigger.ProcessingTime("10 seconds")). queryName("from-kafka-topic1-to-console"). outputMode(OutputMode.Update). start {code} {code} --- Batch: 0 --- 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: eventtime#157-T3ms at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370) at org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.streaming.WatermarkSupport$class.watermarkPredicateForKeys(statefulOperators.scala:160) at org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.watermarkPredicateForKeys$lzycompute(statefulOperators.scala:350) at org.apache.spark.sql.exec
[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure
[ https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-21546: Summary: dropDuplicates with watermark yields RuntimeException due to binding failure (was: dropDuplicates followed by select yields RuntimeException due to binding failure) > dropDuplicates with watermark yields RuntimeException due to binding failure > > > Key: SPARK-21546 > URL: https://issues.apache.org/jira/browse/SPARK-21546 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski > > With today's master... > The following streaming query yields {{RuntimeException}} due to failure in > binding (most likely due to {{select}} operator). > {code} > val topic1 = spark. > readStream. > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > option("startingoffsets", "earliest"). > load > val records = topic1. > withColumn("eventtime", 'timestamp). // <-- just to put the right name > given the purpose > withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // > <-- use the renamed eventtime column > dropDuplicates("value"). // dropDuplicates will use watermark > // only when eventTime column exists > // include the watermark column => internal design leak? > select('key cast "string", 'value cast "string", 'eventtime). > as[(String, String, java.sql.Timestamp)] > scala> records.explain > == Physical Plan == > *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS > value#170, eventtime#157-T3ms] > +- StreamingDeduplicate [value#1], > StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), > 0 >+- Exchange hashpartitioning(value#1, 200) > +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds > +- *Project [key#0, value#1, timestamp#5 AS eventtime#157] > +- StreamingRelation kafka, [key#0, value#1, topic#2, > partition#3, offset#4L, timestamp#5, timestampType#6] > import org.apache.spark.sql.streaming.{OutputMode, Trigger} > val sq = records. > writeStream. > format("console"). > option("truncate", false). > trigger(Trigger.ProcessingTime("10 seconds")). > queryName("from-kafka-topic1-to-console"). > outputMode(OutputMode.Update). > start > {code} > {code} > --- > Batch: 0 > --- > 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID > 438) > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: eventtime#157-T3ms > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370) > at > org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$ap
[jira] [Created] (SPARK-21546) dropDuplicates followed by select yields RuntimeException due to binding failure
Jacek Laskowski created SPARK-21546: --- Summary: dropDuplicates followed by select yields RuntimeException due to binding failure Key: SPARK-21546 URL: https://issues.apache.org/jira/browse/SPARK-21546 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jacek Laskowski With today's master... The following streaming query yields {{RuntimeException}} due to failure in binding (most likely due to {{select}} operator). {code} val topic1 = spark. readStream. format("kafka"). option("subscribe", "topic1"). option("kafka.bootstrap.servers", "localhost:9092"). option("startingoffsets", "earliest"). load val records = topic1. withColumn("eventtime", 'timestamp). // <-- just to put the right name given the purpose withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- use the renamed eventtime column dropDuplicates("value"). // dropDuplicates will use watermark // only when eventTime column exists // include the watermark column => internal design leak? select('key cast "string", 'value cast "string", 'eventtime). as[(String, String, java.sql.Timestamp)] scala> records.explain == Physical Plan == *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS value#170, eventtime#157-T3ms] +- StreamingDeduplicate [value#1], StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0 +- Exchange hashpartitioning(value#1, 200) +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds +- *Project [key#0, value#1, timestamp#5 AS eventtime#157] +- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6] import org.apache.spark.sql.streaming.{OutputMode, Trigger} val sq = records. writeStream. format("console"). option("truncate", false). trigger(Trigger.ProcessingTime("10 seconds")). queryName("from-kafka-topic1-to-console"). outputMode(OutputMode.Update). start {code} {code} --- Batch: 0 --- 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: eventtime#157-T3ms at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370) at org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.streaming.WatermarkSupport$class.w
[jira] [Created] (SPARK-21429) show on structured Dataset is equivalent to writeStream to console once
Jacek Laskowski created SPARK-21429: --- Summary: show on structured Dataset is equivalent to writeStream to console once Key: SPARK-21429 URL: https://issues.apache.org/jira/browse/SPARK-21429 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor While working with Datasets it's often helpful to do {{show}}. It does not work for streaming Datasets (and leads to {{AnalysisException}} - see below), but think it could just be the following under the covers and very helpful (would cut plenty of keystrokes for sure). {code} val sq = ... scala> sq.isStreaming res0: Boolean = true import org.apache.spark.sql.streaming.Trigger scala> sq.writeStream.format("console").trigger(Trigger.Once).start {code} Since {{show}} returns {{Unit}} that could just work. Currently {{show}} reports {{AnalysisException}}. {code} scala> sq.show org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; rate at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34) at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3027) at org.apache.spark.sql.Dataset.head(Dataset.scala:2340) at org.apache.spark.sql.Dataset.take(Dataset.scala:2553) at org.apache.spark.sql.Dataset.showString(Dataset.scala:241) at org.apache.spark.sql.Dataset.show(Dataset.scala:671) at org.apache.spark.sql.Dataset.show(Dataset.scala:630) at org.apache.spark.sql.Dataset.show(Dataset.scala:639) ... 50 elided {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21427) Describe mapGroupsWithState and flatMapGroupsWithState for stateful aggregation in Structured Streaming
Jacek Laskowski created SPARK-21427: --- Summary: Describe mapGroupsWithState and flatMapGroupsWithState for stateful aggregation in Structured Streaming Key: SPARK-21427 URL: https://issues.apache.org/jira/browse/SPARK-21427 Project: Spark Issue Type: Improvement Components: Documentation, Structured Streaming Affects Versions: 2.2.1 Reporter: Jacek Laskowski # Rename "Arbitrary Stateful Operations" to "Stateful Aggregations" (see https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations for the latest version) # Describe two operators {{mapGroupsWithState}} and {{flatMapGroupsWithState}} with examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21329) Make EventTimeWatermarkExec explicitly UnaryExecNode
Jacek Laskowski created SPARK-21329: --- Summary: Make EventTimeWatermarkExec explicitly UnaryExecNode Key: SPARK-21329 URL: https://issues.apache.org/jira/browse/SPARK-21329 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.1 Reporter: Jacek Laskowski Priority: Trivial Make {{EventTimeWatermarkExec}} explicitly {{UnaryExecNode}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21313) ConsoleSink's string representation
Jacek Laskowski created SPARK-21313: --- Summary: ConsoleSink's string representation Key: SPARK-21313 URL: https://issues.apache.org/jira/browse/SPARK-21313 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.1 Reporter: Jacek Laskowski Priority: Trivial Add {{toString}} with options for {{ConsoleSink}} so it shows nicely in query progress. *BEFORE* {code} "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@4b340441" } {code} *AFTER* {code} "sink" : { "description" : "ConsoleSink[numRows=10, truncate=false]" } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21285) VectorAssembler should report the column name when data type used is not supported
Jacek Laskowski created SPARK-21285: --- Summary: VectorAssembler should report the column name when data type used is not supported Key: SPARK-21285 URL: https://issues.apache.org/jira/browse/SPARK-21285 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.1 Reporter: Jacek Laskowski Priority: Minor Found while answering [Why does LogisticRegression fail with “IllegalArgumentException: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7”?|https://stackoverflow.com/q/44844793/1305344] on StackOverflow. When {{VectorAssembler}} is configured to use columns of unsupported type only the type is printed out without the column name(s). The column name(s) should be included too. {code} // label is of StringType type val va = new VectorAssembler().setInputCols(Array("bc", "pmi", "label")) scala> va.transform(training) java.lang.IllegalArgumentException: Data type StringType is not supported. at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121) at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54) ... 48 elided {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic
[ https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070865#comment-16070865 ] Jacek Laskowski commented on SPARK-20597: - Go for it, [~Satyajit]! > KafkaSourceProvider falls back on path as synonym for topic > --- > > Key: SPARK-20597 > URL: https://issues.apache.org/jira/browse/SPARK-20597 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic > to save a DataFrame's rows to > # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka > topics for writing > What seems a quite interesting option is to support {{start(path: String)}} > as the least precedence option in which {{path}} would designate the default > topic when no other options are used. > {code} > df.writeStream.format("kafka").start("topic") > {code} > See > http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html > for discussion -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"
[ https://issues.apache.org/jira/browse/SPARK-20997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040594#comment-16040594 ] Jacek Laskowski commented on SPARK-20997: - Go ahead! Thanks [~guoxiaolongzte]! > spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark > standalone with cluster deploy mode only" > - > > Key: SPARK-20997 > URL: https://issues.apache.org/jira/browse/SPARK-20997 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Submit >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Just noticed that {{spark-submit}} describes {{--driver-cores}} under: > * Spark standalone with cluster deploy mode only > * YARN-only > While I can understand "only" in "Spark standalone with cluster deploy mode > only" to refer to cluster deploy mode (not the default client mode), but > YARN-only baffles me which I think deserves a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"
Jacek Laskowski created SPARK-20997: --- Summary: spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only" Key: SPARK-20997 URL: https://issues.apache.org/jira/browse/SPARK-20997 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Minor Just noticed that {{spark-submit}} describes {{--driver-cores}} under: * Spark standalone with cluster deploy mode only * YARN-only While I can understand "only" in "Spark standalone with cluster deploy mode only" to refer to cluster deploy mode (not the default client mode), but YARN-only baffles me which I think deserves a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20782) Dataset's isCached operator
[ https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035316#comment-16035316 ] Jacek Laskowski commented on SPARK-20782: - Just stumbled upon {{CatalogImpl.isCached}} that could also be used to implement this feature. > Dataset's isCached operator > --- > > Key: SPARK-20782 > URL: https://issues.apache.org/jira/browse/SPARK-20782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > It'd be very convenient to have {{isCached}} operator that would say whether > a query is cached in-memory or not. > It'd be as simple as the following snippet: > {code} > // val q2: DataFrame > spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
[ https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20937: Description: As a follow-up to SPARK-20297 (and SPARK-10400) in which {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala and Hive, Spark SQL docs for [Parquet Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] should have it documented. p.s. It was asked about in [Why can't Impala read parquet files after Spark SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow today. p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 3-10. Parquet data source options) that gives the option some wider publicity. was: As a follow-up to SPARK-20297 (and SPARK-10400) in which {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala and Hive, Spark SQL docs for [Parquet Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] should have it documented. p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 3-10. Parquet data source options) that gives the option some wider publicity. > Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, > DataFrames and Datasets Guide > - > > Key: SPARK-20937 > URL: https://issues.apache.org/jira/browse/SPARK-20937 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As a follow-up to SPARK-20297 (and SPARK-10400) in which > {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala > and Hive, Spark SQL docs for [Parquet > Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] > should have it documented. > p.s. It was asked about in [Why can't Impala read parquet files after Spark > SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow > today. > p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance > Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table > 3-10. Parquet data source options) that gives the option some wider publicity. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
Jacek Laskowski created SPARK-20937: --- Summary: Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide Key: SPARK-20937 URL: https://issues.apache.org/jira/browse/SPARK-20937 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Trivial As a follow-up to SPARK-20297 (and SPARK-10400) in which {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala and Hive, Spark SQL docs for [Parquet Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] should have it documented. p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 3-10. Parquet data source options) that gives the option some wider publicity. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20865) caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"
[ https://issues.apache.org/jira/browse/SPARK-20865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski resolved SPARK-20865. - Resolution: Won't Fix Fix Version/s: 2.3.0 2.2.0 {{cache}} is not allowed due to its eager execution. > caching dataset throws "Queries with streaming sources must be executed with > writeStream.start()" > - > > Key: SPARK-20865 > URL: https://issues.apache.org/jira/browse/SPARK-20865 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.2, 2.1.0, 2.1.1 >Reporter: Martin Brišiak > Fix For: 2.2.0, 2.3.0 > > > {code} > SparkSession > .builder > .master("local[*]") > .config("spark.sql.warehouse.dir", "C:/tmp/spark") > .config("spark.sql.streaming.checkpointLocation", > "C:/tmp/spark/spark-checkpoint") > .appName("my-test") > .getOrCreate > .readStream > .schema(schema) > .json("src/test/data") > .cache > .writeStream > .start > .awaitTermination > {code} > While executing this sample in spark got error. Without the .cache option it > worked as intended but with .cache option i got: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries > with streaming sources must be executed with writeStream.start();; > FileSource[src/test/data] at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33) > at > org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102) > at > org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) > at > org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) > at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at > org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at > org.me.App$.main(App.scala:23) at org.me.App.main(App.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming
Jacek Laskowski created SPARK-20927: --- Summary: Add cache operator to Unsupported Operations in Structured Streaming Key: SPARK-20927 URL: https://issues.apache.org/jira/browse/SPARK-20927 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Trivial Just [found out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries] that {{cache}} is not allowed on streaming datasets. {{cache}} on streaming datasets leads to the following exception: {code} scala> spark.readStream.text("files").cache org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; FileSource[files] at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34) at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104) at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68) at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92) at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603) at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613) ... 48 elided {code} It should be included in Structured Streaming's [Unsupported Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20912) map function with columns as strings
[ https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028285#comment-16028285 ] Jacek Laskowski commented on SPARK-20912: - I can and I did, but the point is that it's not consistent with the other functions like {{struct}} and {{array}} and often gets in the way. > map function with columns as strings > > > Key: SPARK-20912 > URL: https://issues.apache.org/jira/browse/SPARK-20912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > There's only {{map}} function that accepts {{Column}} values only. It'd be > very helpful to have a variant that accepted {{String}} for columns like > {{array}} or {{struct}}. > {code} > scala> val kvs = Seq(("key", "value")).toDF("k", "v") > kvs: org.apache.spark.sql.DataFrame = [k: string, v: string] > scala> kvs.printSchema > root > |-- k: string (nullable = true) > |-- v: string (nullable = true) > scala> kvs.withColumn("map", map("k", "v")).show > :26: error: type mismatch; > found : String("k") > required: org.apache.spark.sql.Column >kvs.withColumn("map", map("k", "v")).show > ^ > :26: error: type mismatch; > found : String("v") > required: org.apache.spark.sql.Column >kvs.withColumn("map", map("k", "v")).show > ^ > // note $ to create Columns per string > // not very dev-friendly > scala> kvs.withColumn("map", map($"k", $"v")).show > +---+-+-+ > | k|v| map| > +---+-+-+ > |key|value|Map(key -> value)| > +---+-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20912) map function with columns as strings
[ https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028208#comment-16028208 ] Jacek Laskowski commented on SPARK-20912: - Nope as it would create a map of the two literals but I'd like to create a map from the values in {{k}} and {{v}} columns. > map function with columns as strings > > > Key: SPARK-20912 > URL: https://issues.apache.org/jira/browse/SPARK-20912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > There's only {{map}} function that accepts {{Column}} values only. It'd be > very helpful to have a variant that accepted {{String}} for columns like > {{array}} or {{struct}}. > {code} > scala> val kvs = Seq(("key", "value")).toDF("k", "v") > kvs: org.apache.spark.sql.DataFrame = [k: string, v: string] > scala> kvs.printSchema > root > |-- k: string (nullable = true) > |-- v: string (nullable = true) > scala> kvs.withColumn("map", map("k", "v")).show > :26: error: type mismatch; > found : String("k") > required: org.apache.spark.sql.Column >kvs.withColumn("map", map("k", "v")).show > ^ > :26: error: type mismatch; > found : String("v") > required: org.apache.spark.sql.Column >kvs.withColumn("map", map("k", "v")).show > ^ > // note $ to create Columns per string > // not very dev-friendly > scala> kvs.withColumn("map", map($"k", $"v")).show > +---+-+-+ > | k|v| map| > +---+-+-+ > |key|value|Map(key -> value)| > +---+-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20912) map function with columns as strings
Jacek Laskowski created SPARK-20912: --- Summary: map function with columns as strings Key: SPARK-20912 URL: https://issues.apache.org/jira/browse/SPARK-20912 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Trivial There's only {{map}} function that accepts {{Column}} values only. It'd be very helpful to have a variant that accepted {{String}} for columns like {{array}} or {{struct}}. {code} scala> val kvs = Seq(("key", "value")).toDF("k", "v") kvs: org.apache.spark.sql.DataFrame = [k: string, v: string] scala> kvs.printSchema root |-- k: string (nullable = true) |-- v: string (nullable = true) scala> kvs.withColumn("map", map("k", "v")).show :26: error: type mismatch; found : String("k") required: org.apache.spark.sql.Column kvs.withColumn("map", map("k", "v")).show ^ :26: error: type mismatch; found : String("v") required: org.apache.spark.sql.Column kvs.withColumn("map", map("k", "v")).show ^ // note $ to create Columns per string // not very dev-friendly scala> kvs.withColumn("map", map($"k", $"v")).show +---+-+-+ | k|v| map| +---+-+-+ |key|value|Map(key -> value)| +---+-+-+ {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20782) Dataset's isCached operator
Jacek Laskowski created SPARK-20782: --- Summary: Dataset's isCached operator Key: SPARK-20782 URL: https://issues.apache.org/jira/browse/SPARK-20782 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Trivial It'd be very convenient to have {{isCached}} operator that would say whether a query is cached in-memory or not. It'd be as simple as the following snippet: {code} // val q2: DataFrame spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4570) Add broadcast join to left semi join
[ https://issues.apache.org/jira/browse/SPARK-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-4570: --- Description: For now, spark use broadcast join instead of hash join to optimize {{inner join}} when the size of one side data did not reach the {{AUTO_BROADCASTJOIN_THRESHOLD}} However,Spark SQL will perform shuffle operations on each child relations while executing {{left semi join}} is more suitable for optimiztion with broadcast join. We are planning to create a {{BroadcastLeftSemiJoinHash}} to implement the broadcast join for {{left semi join}}. was: For now, spark use broadcast join instead of hash join to optimize {{inner join}} when the size of one side data did not reach the {{AUTO_BROADCASTJOIN_THRESHOLD}} However,Spark SQL will perform shuffle operations on each child relations while executing {{left semi join}} is more suitable for optimiztion with broadcast join. We are planning to create a{{BroadcastLeftSemiJoinHash}} to implement the broadcast join for {{left semi join}} > Add broadcast join to left semi join > - > > Key: SPARK-4570 > URL: https://issues.apache.org/jira/browse/SPARK-4570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: XiaoJing wang >Priority: Minor > Fix For: 1.3.0 > > > For now, spark use broadcast join instead of hash join to optimize {{inner > join}} when the size of one side data did not reach the > {{AUTO_BROADCASTJOIN_THRESHOLD}} > However,Spark SQL will perform shuffle operations on each child relations > while executing {{left semi join}} is more suitable for optimiztion with > broadcast join. > We are planning to create a {{BroadcastLeftSemiJoinHash}} to implement the > broadcast join for {{left semi join}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)
[ https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20600: Description: Executing the following batch query gives the default stringified *internal JVM representation* of {{KafkaRelation}} in web UI (under Details for Query), i.e. http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the attachment. {code} spark. read. format("kafka"). option("subscribe", "topic1"). option("kafka.bootstrap.servers", "localhost:9092"). load. show {code} was: Executing the following batch query gives the default stringified/internal name of {{KafkaRelation}} in web UI (under Details for Query), i.e. http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the attachment. {code} spark. read. format("kafka"). option("subscribe", "topic1"). option("kafka.bootstrap.servers", "localhost:9092"). load. select('value cast "string"). write. csv("fromkafka.csv") {code} > KafkaRelation should be pretty printed in web UI (Details for Query) > > > Key: SPARK-20600 > URL: https://issues.apache.org/jira/browse/SPARK-20600 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > Fix For: 2.2.0 > > Attachments: kafka-source-scan-webui.png > > > Executing the following batch query gives the default stringified *internal > JVM representation* of {{KafkaRelation}} in web UI (under Details for Query), > i.e. http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See > the attachment. > {code} > spark. > read. > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > load. > show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI
Jacek Laskowski created SPARK-20691: --- Summary: Difference between Storage Memory as seen internally and in web UI Key: SPARK-20691 URL: https://issues.apache.org/jira/browse/SPARK-20691 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0 Reporter: Jacek Laskowski I set Major priority as it's visible to a user. There's a difference in what the size of Storage Memory is managed internally and displayed to a user in web UI. I found it while answering [How does web UI calculate Storage Memory (in Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow. In short (quoting the main parts), when you start a Spark app (say spark-shell) you see 912.3 MB RAM for Storage Memory: {code} $ ./bin/spark-shell --conf spark.driver.memory=2g ... 17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, 57177, None) {code} but in the web UI you'll see 956.6 MB due to the way the custom JavaScript function {{formatBytes}} in [utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48] calculates the value. That translates to the following Scala code: {code} def formatBytes(bytes: Double) = { val k = 1000 val i = math.floor(math.log(bytes) / math.log(k)) val maxMemoryWebUI = bytes / math.pow(k, i) f"$maxMemoryWebUI%1.1f" } scala> println(formatBytes(maxMemory)) 956.6 {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
[ https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001259#comment-16001259 ] Jacek Laskowski commented on SPARK-20630: - Go [~ajbozarth], go! > Thread Dump link available in Executors tab irrespective of > spark.ui.threadDumpsEnabled > --- > > Key: SPARK-20630 > URL: https://issues.apache.org/jira/browse/SPARK-20630 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-threadDump.png > > > Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors > page displays *Thread Dump* column with an active link (that does nothing > though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org