[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"
[ https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-44108: -- Description: Hello all, I have a client that has a column named : bfzgtäeil Spark cannot handle this. My test: {code:java} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.scalatest.funsuite.AnyFunSuite class HiveTest extends AnyFunSuite { test("test that Spark does not cut columns with ä") { val data = "bfzugtäeil:string" CatalystSqlParser.parseDataType(data) } } {code} I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. Any ideas ? {code:java} == SQL ==bfzugtäeil:string--^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41) at com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13) {code} was: Hello all, I have a client that has a column named : bfzgtäeil Spark cannot handle this. My test: {code:java} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.scalatest.funsuite.AnyFunSuite class HiveTest extends AnyFunSuite { test("test that Spark does not cut columns with ä") { val data = "bfzugtäeil:string" CatalystSqlParser.parseDataType(data) } } {code} I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. Any ideas ? > Cannot parse Type from german "umlaut" > -- > > Key: SPARK-44108 > URL: https://issues.apache.org/jira/browse/SPARK-44108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jorge Machado >Priority: Major > > Hello all, > > I have a client that has a column named : bfzgtäeil > Spark cannot handle this. My test: > > {code:java} > import org.apache.spark.sql.catalyst.parser.CatalystSqlParser > import org.scalatest.funsuite.AnyFunSuite > class HiveTest extends AnyFunSuite { > test("test that Spark does not cut columns with ä") { > val data = > "bfzugtäeil:string" > CatalystSqlParser.parseDataType(data) > } > } {code} > I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. > Any ideas ? > > {code:java} > == SQL ==bfzugtäeil:string--^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306) >at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41) >at > com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"
[ https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-44108: -- Description: Hello all, I have a client that has a column named : bfzgtäeil Spark cannot handle this. My test: {code:java} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.scalatest.funsuite.AnyFunSuite class HiveTest extends AnyFunSuite { test("test that Spark does not cut columns with ä") { val data = "bfzugtäeil:string" CatalystSqlParser.parseDataType(data) } } {code} I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. Any ideas ? {code:java} == SQL ==bfzugtäeil:string--^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41) at com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13) {code} was: Hello all, I have a client that has a column named : bfzgtäeil Spark cannot handle this. My test: {code:java} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.scalatest.funsuite.AnyFunSuite class HiveTest extends AnyFunSuite { test("test that Spark does not cut columns with ä") { val data = "bfzugtäeil:string" CatalystSqlParser.parseDataType(data) } } {code} I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. Any ideas ? {code:java} == SQL ==bfzugtäeil:string--^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41) at com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13) {code} > Cannot parse Type from german "umlaut" > -- > > Key: SPARK-44108 > URL: https://issues.apache.org/jira/browse/SPARK-44108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jorge Machado >Priority: Major > > Hello all, > > I have a client that has a column named : bfzgtäeil > Spark cannot handle this. My test: > > {code:java} > import org.apache.spark.sql.catalyst.parser.CatalystSqlParser > import org.scalatest.funsuite.AnyFunSuite > class HiveTest extends AnyFunSuite { > test("test that Spark does not cut columns with ä") { > val data = "bfzugtäeil:string" > CatalystSqlParser.parseDataType(data) > } > } {code} > I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. > Any ideas ? > > {code:java} > == SQL ==bfzugtäeil:string--^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306) >at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41) >at > com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"
[ https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-44108: -- Priority: Major (was: Minor) > Cannot parse Type from german "umlaut" > -- > > Key: SPARK-44108 > URL: https://issues.apache.org/jira/browse/SPARK-44108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jorge Machado >Priority: Major > > Hello all, > > I have a client that has a column named : bfzgtäeil > Spark cannot handle this. My test: > > {code:java} > import org.apache.spark.sql.catalyst.parser.CatalystSqlParser > import org.scalatest.funsuite.AnyFunSuite > class HiveTest extends AnyFunSuite { > test("test that Spark does not cut columns with ä") { > val data = > "bfzugtäeil:string" > CatalystSqlParser.parseDataType(data) > } > } {code} > I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. > Any ideas ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"
[ https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-44108: -- Priority: Minor (was: Critical) > Cannot parse Type from german "umlaut" > -- > > Key: SPARK-44108 > URL: https://issues.apache.org/jira/browse/SPARK-44108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jorge Machado >Priority: Minor > > Hello all, > > I have a client that has a column named : bfzgtäeil > Spark cannot handle this. My test: > > {code:java} > import org.apache.spark.sql.catalyst.parser.CatalystSqlParser > import org.scalatest.funsuite.AnyFunSuite > class HiveTest extends AnyFunSuite { > test("test that Spark does not cut columns with ä") { > val data = > "bfzugtäeil:string" > CatalystSqlParser.parseDataType(data) > } > } {code} > I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. > Any ideas ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44108) Cannot parse Type from german "umlaut"
Jorge Machado created SPARK-44108: - Summary: Cannot parse Type from german "umlaut" Key: SPARK-44108 URL: https://issues.apache.org/jira/browse/SPARK-44108 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Jorge Machado Hello all, I have a client that has a column named : bfzgtäeil Spark cannot handle this. My test: {code:java} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.scalatest.funsuite.AnyFunSuite class HiveTest extends AnyFunSuite { test("test that Spark does not cut columns with ä") { val data = "bfzugtäeil:string" CatalystSqlParser.parseDataType(data) } } {code} I debugged it and I'm deep on the org.antlr.v4.runtime.Lexer class. Any ideas ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33772) Build and Run Spark on Java 17
[ https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653941#comment-17653941 ] Jorge Machado edited comment on SPARK-33772 at 1/3/23 12:27 PM: I still have an issue with this. Running sbt test fails and I don't know why. {code:java} An exception or error caused a run to abort: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x1aa7ecca) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x1aa7ecca {code} was (Author: jomach): I still have an issue with this. Running sbt test fails and I don't know why. [error] Uncaught exception when running .LocalViewCreatorTest: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ 23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on port 49902. [error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ [error] at org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114) > Build and Run Spark on Java 17 > -- > > Key: SPARK-33772 > URL: https://issues.apache.org/jira/browse/SPARK-33772 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Yang Jie >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is > 17. > ||Version||Release Date|| > |Java 17 (LTS)|September 2021| > Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along > with the release branch cut. > - https://spark.apache.org/versioning-policy.html > Supporting new Java version is considered as a new feature which we cannot > allow to backport. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33772) Build and Run Spark on Java 17
[ https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653941#comment-17653941 ] Jorge Machado edited comment on SPARK-33772 at 1/3/23 12:27 PM: I still have an issue with this. Running sbt test fails and I don't know why. [error] Uncaught exception when running .LocalViewCreatorTest: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ 23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on port 49902. [error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ [error] at org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114) was (Author: jomach): I still have an issue with this. Running sbt test fails and I don't know why. [error] Uncaught exception when running com.deutschebahn.zod.fvdl.commons.aws.athena.LocalViewCreatorTest: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ 23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on port 49902. [error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ [error] at org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114) > Build and Run Spark on Java 17 > -- > > Key: SPARK-33772 > URL: https://issues.apache.org/jira/browse/SPARK-33772 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Yang Jie >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is > 17. > ||Version||Release Date|| > |Java 17 (LTS)|September 2021| > Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along > with the release branch cut. > - https://spark.apache.org/versioning-policy.html > Supporting new Java version is considered as a new feature which we cannot > allow to backport. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33772) Build and Run Spark on Java 17
[ https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653941#comment-17653941 ] Jorge Machado commented on SPARK-33772: --- I still have an issue with this. Running sbt test fails and I don't know why. [error] Uncaught exception when running com.deutschebahn.zod.fvdl.commons.aws.athena.LocalViewCreatorTest: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ 23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on port 49902. [error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ [error] at org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114) > Build and Run Spark on Java 17 > -- > > Key: SPARK-33772 > URL: https://issues.apache.org/jira/browse/SPARK-33772 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Yang Jie >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is > 17. > ||Version||Release Date|| > |Java 17 (LTS)|September 2021| > Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along > with the release branch cut. > - https://spark.apache.org/versioning-policy.html > Supporting new Java version is considered as a new feature which we cannot > allow to backport. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31683) Make Prometheus output consistent with DropWizard 4.1 result
[ https://issues.apache.org/jira/browse/SPARK-31683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128625#comment-17128625 ] Jorge Machado commented on SPARK-31683: --- It would be great if we could use the RPC backend from spark to be able to aggregate this into the driver only This way we only need to scrape the driver. I have made an implementation based on yours that registers it to consul. This way we can discover the yarn applications via consul for example... > Make Prometheus output consistent with DropWizard 4.1 result > > > Key: SPARK-31683 > URL: https://issues.apache.org/jira/browse/SPARK-31683 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > SPARK-29032 adds Prometheus support. > After that, SPARK-29674 upgraded DropWizard for JDK9+ support and causes > difference in output labels and number of keys. > > This issue aims to fix this inconsistency in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31683) Make Prometheus output consistent with DropWizard 4.1 result
[ https://issues.apache.org/jira/browse/SPARK-31683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126671#comment-17126671 ] Jorge Machado commented on SPARK-31683: --- [~dongjoon] this only exporters the metrics from the driver right? > Make Prometheus output consistent with DropWizard 4.1 result > > > Key: SPARK-31683 > URL: https://issues.apache.org/jira/browse/SPARK-31683 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > SPARK-29032 adds Prometheus support. > After that, SPARK-29674 upgraded DropWizard for JDK9+ support and causes > difference in output labels and number of keys. > > This issue aims to fix this inconsistency in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26902) Support java.time.Instant as an external type of TimestampType
[ https://issues.apache.org/jira/browse/SPARK-26902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085759#comment-17085759 ] Jorge Machado commented on SPARK-26902: --- what about Supporting the interface Temporal ? > Support java.time.Instant as an external type of TimestampType > -- > > Key: SPARK-26902 > URL: https://issues.apache.org/jira/browse/SPARK-26902 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark supports the java.sql.Date and java.sql.Timestamp types as > external types for Catalyst's DateType and TimestampType. It accepts and > produces values of such types. Since Java 8, base classes for dates and > timestamps are java.time.Instant, java.time.LocalDate/LocalDateTime, and > java.time.ZonedDateTime. Need to add new converters from/to Instant. > The Instant type holds epoch seconds (and nanoseconds), and directly reflects > to Catalyst's TimestampType. > Main motivations for the changes: > - Smoothly support Java 8 time API > - Avoid inconsistency of calendars used inside Spark 3.0 (Proleptic Gregorian > calendar) and inside of java.sql.Timestamp (hybrid calendar - Julian + > Gregorian). > - Make conversion independent from current system timezone. > In case of collecting values of Date/TimestampType, the following SQL config > can control types of returned values: > - spark.sql.catalyst.timestampType with supported values > "java.sql.Timestamp" (by default) and "java.time.Instant" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27
[ https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070774#comment-17070774 ] Jorge Machado commented on SPARK-30272: --- I failed to fix the guava stuff of course ... Today morning I tried to replicate the problem of the missing azure-hadoop jar but It seems to be working without any patch from my side. . I assume that I did something wrong on build. Just for reference my steps: {code:java} git checkout v3.0.0-preview-rc2 ./build/mvn clean package -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud ./bin/docker-image-tool.sh -r docker.io/myrepo -t v2.3.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build docker run --rm -it myrepo/spark:v2.3.0 bash 185@57fb3dd68902:/opt/spark/jars$ ls -altr *azure* -rw-r--r-- 1 root root 67314 Mar 28 17:15 hadoop-azure-datalake-3.2.0.jar -rw-r--r-- 1 root root 480512 Mar 28 17:15 hadoop-azure-3.2.0.jar -rw-r--r-- 1 root root 812977 Mar 28 17:15 azure-storage-7.0.0.jar -rw-r--r-- 1 root root 10288 Mar 28 17:15 azure-keyvault-core-1.0.0.jar -rw-r--r-- 1 root root 94061 Mar 28 17:15 azure-data-lake-store-sdk-2.2.9.jar {code} As you see the hadoop-azure is there but not on version 3.2.1 but I guess this is a matter of updating the pom. > Remove usage of Guava that breaks in Guava 27 > - > > Key: SPARK-30272 > URL: https://issues.apache.org/jira/browse/SPARK-30272 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.0 > > > Background: > https://issues.apache.org/jira/browse/SPARK-29250 > https://github.com/apache/spark/pull/25932 > Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods > that changed between those releases, typically just a rename, but, means one > set of code can't work with both, while we want to work with Hadoop 2.x and > 3.x. Among them: > - Objects.toStringHelper was moved to MoreObjects; we can just use the > Commons Lang3 equivalent > - Objects.hashCode etc were renamed; use java.util.Objects equivalents > - MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread > execution we can use a dummy implementation of ExecutorService / Executor > - TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection > There is probably more to the Guava issue than just this change, but it will > make Spark itself work with more versions and reduce our exposure to Guava > along the way anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27
[ https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070255#comment-17070255 ] Jorge Machado commented on SPARK-30272: --- So I was able to fix it. I build it with profile hadoop 3.2 but after the build the hadoop-azure.jar is missing so I added manually into my container and now it seems to load. I was trying to put guava 28 and remove the 14 but this is a lot of work... why do we use a old guava version ? > Remove usage of Guava that breaks in Guava 27 > - > > Key: SPARK-30272 > URL: https://issues.apache.org/jira/browse/SPARK-30272 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.0 > > > Background: > https://issues.apache.org/jira/browse/SPARK-29250 > https://github.com/apache/spark/pull/25932 > Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods > that changed between those releases, typically just a rename, but, means one > set of code can't work with both, while we want to work with Hadoop 2.x and > 3.x. Among them: > - Objects.toStringHelper was moved to MoreObjects; we can just use the > Commons Lang3 equivalent > - Objects.hashCode etc were renamed; use java.util.Objects equivalents > - MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread > execution we can use a dummy implementation of ExecutorService / Executor > - TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection > There is probably more to the Guava issue than just this change, but it will > make Spark itself work with more versions and reduce our exposure to Guava > along the way anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30272) Remove usage of Guava that breaks in Guava 27
[ https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069933#comment-17069933 ] Jorge Machado edited comment on SPARK-30272 at 3/28/20, 4:25 PM: - Hey Sean, This seems still to make problems for example: {code:java} > $ ./bin/run-example SparkPi 100 > > > [±master ●]> $ ./bin/run-example SparkPi 100 > > > [±master ●]20/03/28 17:21:13 WARN Utils: Your hostname, > Jorges-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using > 192.168.1.2 instead (on interface en0)20/03/28 17:21:13 WARN Utils: Set > SPARK_LOCAL_IP if you need to bind to another address20/03/28 17:21:14 WARN > NativeCodeLoader: Unable to load native-hadoop library for your platform... > using builtin-java classes where applicableUsing Spark's default log4j > profile: org/apache/spark/log4j-defaults.properties20/03/28 17:21:14 INFO > SparkContext: Running Spark version 3.1.0-SNAPSHOT20/03/28 17:21:14 INFO > ResourceUtils: > ==20/03/28 > 17:21:14 INFO ResourceUtils: No custom resources configured for > spark.driver.20/03/28 17:21:14 INFO ResourceUtils: > ==20/03/28 > 17:21:14 INFO SparkContext: Submitted application: Spark Pi20/03/28 17:21:14 > INFO ResourceProfile: Default ResourceProfile created, executor resources: > Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: > memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: > cpus, amount: 1.0)20/03/28 17:21:14 INFO ResourceProfile: Limiting resource > is cpu20/03/28 17:21:14 INFO ResourceProfileManager: Added ResourceProfile > id: 020/03/28 17:21:14 INFO SecurityManager: Changing view acls to: > jorge20/03/28 17:21:14 INFO SecurityManager: Changing modify acls to: > jorge20/03/28 17:21:14 INFO SecurityManager: Changing view acls groups > to:20/03/28 17:21:14 INFO SecurityManager: Changing modify acls groups > to:20/03/28 17:21:14 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(jorge); groups > with view permissions: Set(); users with modify permissions: Set(jorge); > groups with modify permissions: Set()20/03/28 17:21:14 INFO Utils: > Successfully started service 'sparkDriver' on port 58192.20/03/28 17:21:14 > INFO SparkEnv: Registering MapOutputTracker20/03/28 17:21:14 INFO SparkEnv: > Registering BlockManagerMaster20/03/28 17:21:14 INFO > BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information20/03/28 17:21:14 INFO BlockManagerMasterEndpoint: > BlockManagerMasterEndpoint up20/03/28 17:21:14 INFO SparkEnv: Registering > BlockManagerMasterHeartbeat20/03/28 17:21:14 INFO DiskBlockManager: Created > local directory at > /private/var/folders/0h/5b7dw9p11l58hyk0_s0d3cnhgn/T/blockmgr-d9e88815-075e-4c9b-9cc8-21c72e97c86920/03/28 > 17:21:14 INFO MemoryStore: MemoryStore started with capacity 366.3 > MiB20/03/28 17:21:14 INFO SparkEnv: Registering > OutputCommitCoordinator20/03/28 17:21:15 INFO Utils: Successfully started > service 'SparkUI' on port 4040.20/03/28 17:21:15 INFO SparkUI: Bound SparkUI > to 0.0.0.0, and started at http://192.168.1.2:404020/03/28 17:21:15 INFO > SparkContext: Added JAR > file:///Users/jorge/Downloads/spark/dist/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar > at spark://192.168.1.2:58192/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar > with timestamp 158541247516620/03/28 17:21:15 INFO SparkContext: Added JAR > file:///Users/jorge/Downloads/spark/dist/examples/jars/scopt_2.12-3.7.1.jar > at spark://192.168.1.2:58192/jars/scopt_2.12-3.7.1.jar with timestamp > 158541247516620/03/28 17:21:15 INFO Executor: Starting executor ID driver on > host 192.168.1.220/03/28 17:21:15 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port > 58193.20/03/28 17:21:15 INFO NettyBlockTransferService: Server created on > 192.168.1.2:5819320/03/28 17:21:15 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policyException in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess > at java.lang.ClassLoader.defineClass1(Native Method) at > java.lang.ClassLoader.defineClass(ClassLoader.java:763) at
[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27
[ https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069933#comment-17069933 ] Jorge Machado commented on SPARK-30272: --- Hey Sean, This seems still to make problems for example: {code:java} java.lang.NoClassDefFoundError: com/google/common/util/concurrent/internal/InternalFutureFailureAccess java.lang.NoClassDefFoundError: com/google/common/util/concurrent/internal/InternalFutureFailureAccess at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at com.google.common.cache.LocalCache$LoadingValueReference.(LocalCache.java:3472) at com.google.common.cache.LocalCache$LoadingValueReference.(LocalCache.java:3476) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2134) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045) at com.google.common.cache.LocalCache.get(LocalCache.java:3951) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958) at org.apache.hadoop.security.Groups.getGroups(Groups.java:228) at org.apache.hadoop.security.UserGroupInformation.getGroups(UserGroupInformation.java:1588) at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1453) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:147) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:104) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:522) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:491) at org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1219) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:757) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1207) at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:484) {code} I still see a lot of references to guava 14 on master is this normal ? Sorry for the question... > Remove usage of Guava that
[jira] [Comment Edited] (SPARK-23897) Guava version
[ https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069352#comment-17069352 ] Jorge Machado edited comment on SPARK-23897 at 3/28/20, 9:50 AM: - I think that master is actually broken at least for commit d025ddbaa7e7b9746d8e47aeed61ed39d2f09f0e. I builded with: {code:java} ./build/mvn clean package -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud {code} I get: {code:java} jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:24]jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:24]> $ java -version [±master ✓]java version "1.8.0_211"Java(TM) SE Runtime Environment (build 1.8.0_211-b12)Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode) jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:27]> $ ./run-example SparkPi 100 [±master ✓]Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456) at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code} If I delete guava 14 and add guava28 it works. was (Author: jomach): I think that master is actually broken at least for commit d025ddbaa7e7b9746d8e47aeed61ed39d2f09f0e. I builded with: {code:java} ./build/mvn clean package -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud {code} I get: {code:java} jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:24]jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:24]> $ java -version [±master ✓]java version "1.8.0_211"Java(TM) SE Runtime Environment (build 1.8.0_211-b12)Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode) jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:27]> $ ./run-example SparkPi 100 [±master
[jira] [Commented] (SPARK-23897) Guava version
[ https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069352#comment-17069352 ] Jorge Machado commented on SPARK-23897: --- I think that master is actually broken at least for commit d025ddbaa7e7b9746d8e47aeed61ed39d2f09f0e. I builded with: {code:java} ./build/mvn clean package -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud {code} I get: {code:java} jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:24]jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:24]> $ java -version [±master ✓]java version "1.8.0_211"Java(TM) SE Runtime Environment (build 1.8.0_211-b12)Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode) jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin [10:43:27]> $ ./run-example SparkPi 100 [±master ✓]Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456) at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} > Guava version > - > > Key: SPARK-23897 > URL: https://issues.apache.org/jira/browse/SPARK-23897 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > > Guava dependency version 14 is pretty old, needs to be updated to at least > 16, google cloud storage connector uses newer one which causes pretty popular > error with guava; "java.lang.NoSuchMethodError: > com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;" > and causes app to crash -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046958#comment-17046958 ] Jorge Machado commented on SPARK-26412: --- Thanks for the Tipp. It helps > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046373#comment-17046373 ] Jorge Machado commented on SPARK-26412: --- Well I was thinking on something more. like I would like to give a,b,c to another object. Like {code:java} class SomeClass(): def __init__(a,b,c): pass def map_func(batch_iter): dataset = SomeClass(batch_iter[0], batch_iter[1], batch_iter[2]) <- this does not work. {code} and another thing, it would be great if we could just yield a json for example instead of this fixed types > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046265#comment-17046265 ] Jorge Machado commented on SPARK-26412: --- Hi, one question. when using "a tuple of pd.Series if UDF is called with more than one Spark DF columns" how can I get the Series into a variables. like a, b, c = iterator ? map seems not to be a python tuple ... > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034492#comment-17034492 ] Jorge Machado commented on SPARK-24615: --- Yeah, that was my question. Thanks for the response. I will look at rapid.ai and try to use it inside a partition or so... > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034277#comment-17034277 ] Jorge Machado commented on SPARK-24615: --- [~tgraves] thanks for the input. It would be great to have one or two examples on how to use the GPUs within a dataset. I tried to figure out the api but I did not find any useful docs. Any tip? > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens
[ https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029788#comment-17029788 ] Jorge Machado commented on SPARK-30647: --- 2.4x has the same issue. > When creating a custom datasource File NotFoundExpection happens > > > Key: SPARK-30647 > URL: https://issues.apache.org/jira/browse/SPARK-30647 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jorge Machado >Priority: Major > > Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. > when I pass a path or a file that has a white space it seems to fail wit the > error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 > (TID 213, localhost, executor driver): java.io.FileNotFoundException: File > file:somePath/0019_leftImg8%20bit.png does not exist It is possible the > underlying files have been updated. You can explicitly invalidate the cache > in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating > the Dataset/DataFrame involved. at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > {code} > I'm happy to fix this if someone tells me where I need to look. > I think it is on org.apache.spark.rdd.InputFileBlockHolder : > {code:java} > inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, > length)) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource
[ https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028856#comment-17028856 ] Jorge Machado commented on SPARK-27990: --- [~nchammas]: Just pass this like: {code:java} //.option("pathGlobFilter", ".*\\.png|.*\\.jpg|.*\\.jpeg|.*\\.PNG|.*\\.JPG|.*\\.JPEG") //.option("recursiveFileLookup", "true") {code} > Provide a way to recursively load data from datasource > -- > > Key: SPARK-27990 > URL: https://issues.apache.org/jira/browse/SPARK-27990 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Provide a way to recursively load data from datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource
[ https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028854#comment-17028854 ] Jorge Machado commented on SPARK-27990: --- Can we backport this to 2.4.4 ? > Provide a way to recursively load data from datasource > -- > > Key: SPARK-27990 > URL: https://issues.apache.org/jira/browse/SPARK-27990 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Provide a way to recursively load data from datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens
[ https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024242#comment-17024242 ] Jorge Machado commented on SPARK-30647: --- I found a way to overcome this. I just replace %20 with " " like this: {code:java} (file: PartitionedFile) => { val origin = file.filePath. replace("%20", " ") }{code} > When creating a custom datasource File NotFoundExpection happens > > > Key: SPARK-30647 > URL: https://issues.apache.org/jira/browse/SPARK-30647 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jorge Machado >Priority: Major > > Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. > when I pass a path or a file that has a white space it seems to fail wit the > error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 > (TID 213, localhost, executor driver): java.io.FileNotFoundException: File > file:somePath/0019_leftImg8%20bit.png does not exist It is possible the > underlying files have been updated. You can explicitly invalidate the cache > in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating > the Dataset/DataFrame involved. at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > {code} > I'm happy to fix this if someone tells me where I need to look. > I think it is on org.apache.spark.rdd.InputFileBlockHolder : > {code:java} > inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, > length)) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935 ] Jorge Machado edited comment on SPARK-23148 at 1/27/20 7:22 AM: Hi [~hyukjin.kwon] and [~henryr] , I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: I created https://issues.apache.org/jira/browse/SPARK-30647 {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} was (Author: jomach): Hi [~hyukjin.kwon] and [~henryr] , I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project:
[jira] [Updated] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens
[ https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-30647: -- Issue Type: Bug (was: Improvement) > When creating a custom datasource File NotFoundExpection happens > > > Key: SPARK-30647 > URL: https://issues.apache.org/jira/browse/SPARK-30647 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jorge Machado >Priority: Major > > Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. > when I pass a path or a file that has a white space it seems to fail wit the > error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 > (TID 213, localhost, executor driver): java.io.FileNotFoundException: File > file:somePath/0019_leftImg8%20bit.png does not exist It is possible the > underlying files have been updated. You can explicitly invalidate the cache > in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating > the Dataset/DataFrame involved. at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > {code} > I'm happy to fix this if someone tells me where I need to look. > I think it is on org.apache.spark.rdd.InputFileBlockHolder : > {code:java} > inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, > length)) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens
Jorge Machado created SPARK-30647: - Summary: When creating a custom datasource File NotFoundExpection happens Key: SPARK-30647 URL: https://issues.apache.org/jira/browse/SPARK-30647 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.2 Reporter: Jorge Machado Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. when I pass a path or a file that has a white space it seems to fail wit the error: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} I'm happy to fix this if someone tells me where I need to look. I think it is on org.apache.spark.rdd.InputFileBlockHolder : {code:java} inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935 ] Jorge Machado edited comment on SPARK-23148 at 1/26/20 9:29 PM: Hi [~hyukjin.kwon] , I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} was (Author: jomach): So I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter:
[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935 ] Jorge Machado edited comment on SPARK-23148 at 1/26/20 9:29 PM: Hi [~hyukjin.kwon] and [~henryr] , I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} was (Author: jomach): Hi [~hyukjin.kwon] , I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects
[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935 ] Jorge Machado commented on SPARK-23148: --- So I have the same problem if I create a custom data source ``` class ImageFileValidator extends FileFormat with DataSourceRegister with Serializable ``` So the Problem Needs to be in some other places. Here my trace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) {code} > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bogdan Raducanu >Assignee: Henry Robinson >Priority: Major > Fix For: 2.3.0 > > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29158) Expose SerializableConfiguration for DSv2
[ https://issues.apache.org/jira/browse/SPARK-29158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997947#comment-16997947 ] Jorge Machado commented on SPARK-29158: --- How can we get SerializableConfiguration with 2.4.4 ? Any alternative ? > Expose SerializableConfiguration for DSv2 > - > > Key: SPARK-29158 > URL: https://issues.apache.org/jira/browse/SPARK-29158 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.0.0 > > > Since we use it frequently inside of our own DataSourceV2 implementations (13 > times from ` > grep -r broadcastedConf ./sql/core/src/ |grep val |wc -l` > ) we should expose the SerializableConfiguration for DSv2 dev work -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882686#comment-16882686 ] Jorge Machado commented on SPARK-24615: --- Hi Guys, is there any progress here ? I would like to help on the implementation step. I did not see any interfaces designed yet. > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http
[ https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806495#comment-16806495 ] Jorge Machado commented on SPARK-27208: --- Any help here please ? I really think this is broken . > RestSubmissionClient only supports http > --- > > Key: SPARK-27208 > URL: https://issues.apache.org/jira/browse/SPARK-27208 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Minor > > As stand of now the class RestSubmissionClient does not support https, which > fails for example if we run mesos master with ssl and in cluster mode. > The spark-submit command fails with: Mesos cluster mode is only supported > through the REST submission API > > I create a PR for this which checks if the master endpoint given can speak > ssl before submitting the command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http
[ https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796871#comment-16796871 ] Jorge Machado commented on SPARK-27208: --- Furthermore I'm getting the next error: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master [mesos://host:5050/api] --deploy-mode cluster --conf spark.master.rest.enabled=true --total-executor-cores 4 --jars /home/machjor/spark-2.4.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.0.jar /home/machjor/spark-2.4.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.0.jar 10 2019-03-18 20:39:31 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2019-03-18 20:39:31 INFO RestSubmissionClient:54 - Submitting a request to launch an application in [mesos://host:5050/api]. 2019-03-18 20:39:32 ERROR RestSubmissionClient:70 - Server responded with error: Some(Failed to validate master::Call: Expecting 'type' to be present) 2019-03-18 20:39:32 ERROR RestSubmissionClient:70 - Error: Server responded with message of unexpected type ErrorResponse. 2019-03-18 20:39:32 INFO ShutdownHookManager:54 - Shutdown hook called 2019-03-18 20:39:32 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-259c0e66-c2ab-43b4-90df- > RestSubmissionClient only supports http > --- > > Key: SPARK-27208 > URL: https://issues.apache.org/jira/browse/SPARK-27208 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Minor > > As stand of now the class RestSubmissionClient does not support https, which > fails for example if we run mesos master with ssl and in cluster mode. > The spark-submit command fails with: Mesos cluster mode is only supported > through the REST submission API > > I create a PR for this which checks if the master endpoint given can speak > ssl before submitting the command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http
[ https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796860#comment-16796860 ] Jorge Machado commented on SPARK-27208: --- I would like to take this. > RestSubmissionClient only supports http > --- > > Key: SPARK-27208 > URL: https://issues.apache.org/jira/browse/SPARK-27208 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Minor > > As stand of now the class RestSubmissionClient does not support https, which > fails for example if we run mesos master with ssl and in cluster mode. > The spark-submit command fails with: Mesos cluster mode is only supported > through the REST submission API > > I create a PR for this which checks if the master endpoint given can speak > ssl before submitting the command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27208) RestSubmissionClient only supports http
[ https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-27208: -- Shepherd: Sean Owen > RestSubmissionClient only supports http > --- > > Key: SPARK-27208 > URL: https://issues.apache.org/jira/browse/SPARK-27208 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Minor > > As stand of now the class RestSubmissionClient does not support https, which > fails for example if we run mesos master with ssl and in cluster mode. > The spark-submit command fails with: Mesos cluster mode is only supported > through the REST submission API > > I create a PR for this which checks if the master endpoint given can speak > ssl before submitting the command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27208) RestSubmissionClient only supports http
Jorge Machado created SPARK-27208: - Summary: RestSubmissionClient only supports http Key: SPARK-27208 URL: https://issues.apache.org/jira/browse/SPARK-27208 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.4.0 Reporter: Jorge Machado As stand of now the class RestSubmissionClient does not support https, which fails for example if we run mesos master with ssl and in cluster mode. The spark-submit command fails with: Mesos cluster mode is only supported through the REST submission API I create a PR for this which checks if the master endpoint given can speak ssl before submitting the command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http
[ https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796544#comment-16796544 ] Jorge Machado commented on SPARK-27208: --- relates to this as the follow up error > RestSubmissionClient only supports http > --- > > Key: SPARK-27208 > URL: https://issues.apache.org/jira/browse/SPARK-27208 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Minor > > As stand of now the class RestSubmissionClient does not support https, which > fails for example if we run mesos master with ssl and in cluster mode. > The spark-submit command fails with: Mesos cluster mode is only supported > through the REST submission API > > I create a PR for this which checks if the master endpoint given can speak > ssl before submitting the command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26324) Spark submit does not work with messos over ssl [Missing docs]
[ https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723761#comment-16723761 ] Jorge Machado commented on SPARK-26324: --- [~hyukjin.kwon] I created a PR for this docs > Spark submit does not work with messos over ssl [Missing docs] > -- > > Key: SPARK-26324 > URL: https://issues.apache.org/jira/browse/SPARK-26324 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Major > > Hi guys, > I was trying to run the examples on a mesos cluster that uses https. I tried > with rest endpoint: > {code:java} > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://:5050 --conf spark.master.rest.enabled=true > --deploy-mode cluster --supervise --executor-memory 10G > --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 > {code} > The error that I get on the host where I started the spark-submit is: > {code:java} > 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to > launch an application in mesos://:5050. > 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to > server mesos://:5050. > Exception in thread "main" > org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect > to server > at > org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104) > at > org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86) > at > org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443) > at > org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable > to connect to server > at > org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281) > at > org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225) > at > org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90) > ... 15 more > Caused by: java.net.SocketException: Connection reset > {code} > I'm pretty sure this is because of the hardcoded http:// here: > > > {code:java} > RestSubmissionClient.scala > /** Return the base URL for communicating with the server, including the > protocol version. */ > private def getBaseUrl(master: String): String = { > var masterUrl = master > supportedMasterPrefixes.foreach { prefix => > if (master.startsWith(prefix)) { > masterUrl = master.stripPrefix(prefix) > } > } > masterUrl = masterUrl.stripSuffix("/") > s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http > } > {code} > Then I tried without the _--deploy-mode cluster_ and I get: > {code:java} > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://:5050 --supervise --executor-memory 10G > --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 > {code} > On the spark console I get: > {code:java} > 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started > at http://_host:4040 > 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR > file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar > at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp > 1544450465799 > I1210 15:01:05.963078 37943
[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl [Missing docs]
[ https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-26324: -- Summary: Spark submit does not work with messos over ssl [Missing docs] (was: Spark submit does not work with messos over ssl) > Spark submit does not work with messos over ssl [Missing docs] > -- > > Key: SPARK-26324 > URL: https://issues.apache.org/jira/browse/SPARK-26324 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Jorge Machado >Priority: Major > > Hi guys, > I was trying to run the examples on a mesos cluster that uses https. I tried > with rest endpoint: > > {code:java} > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://:5050 --conf spark.master.rest.enabled=true > --deploy-mode cluster --supervise --executor-memory 10G > --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 > {code} > The error that I get on the host where I started the spark-submit is: > > > {code:java} > 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to > launch an application in mesos://:5050. > 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to > server mesos://:5050. > Exception in thread "main" > org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect > to server > at > org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104) > at > org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86) > at > org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443) > at > org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable > to connect to server > at > org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281) > at > org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225) > at > org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90) > ... 15 more > Caused by: java.net.SocketException: Connection reset > {code} > I'm pretty sure this is because of the hardcoded http:// here: > > > {code:java} > RestSubmissionClient.scala > /** Return the base URL for communicating with the server, including the > protocol version. */ > private def getBaseUrl(master: String): String = { > var masterUrl = master > supportedMasterPrefixes.foreach { prefix => > if (master.startsWith(prefix)) { > masterUrl = master.stripPrefix(prefix) > } > } > masterUrl = masterUrl.stripSuffix("/") > s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http > } > {code} > > Then I tried without the _--deploy-mode cluster_ and I get: > > {code:java} > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://:5050 --supervise --executor-memory 10G > --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 > {code} > > On the spark console I get: > > {code:java} > 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started > at http://_host:4040 > 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR > file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar > at
[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl [Missing docs]
[ https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-26324: -- Description: Hi guys, I was trying to run the examples on a mesos cluster that uses https. I tried with rest endpoint: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --conf spark.master.rest.enabled=true --deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} The error that I get on the host where I started the spark-submit is: {code:java} 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to launch an application in mesos://:5050. 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server mesos://:5050. Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86) at org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443) at org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281) at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90) ... 15 more Caused by: java.net.SocketException: Connection reset {code} I'm pretty sure this is because of the hardcoded http:// here: {code:java} RestSubmissionClient.scala /** Return the base URL for communicating with the server, including the protocol version. */ private def getBaseUrl(master: String): String = { var masterUrl = master supportedMasterPrefixes.foreach { prefix => if (master.startsWith(prefix)) { masterUrl = master.stripPrefix(prefix) } } masterUrl = masterUrl.stripSuffix("/") s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http } {code} Then I tried without the _--deploy-mode cluster_ and I get: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} On the spark console I get: {code:java} 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://_host:4040 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 1544450465799 I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2 I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at master@53.54.195.251:5050 I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting to register without authentication E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45206: Transport endpoint is not connected E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 307, address 53.54.195.251:45212: Transport endpoint is not connected E1210 15:01:05.969405 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45222: Transport endpoint is not
[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl
[ https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-26324: -- Description: Hi guys, I was trying to run the examples on a mesos cluster that uses https. I tried with rest endpoint: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --conf spark.master.rest.enabled=true --deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} The error that I get on the host where I started the spark-submit is: {code:java} 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to launch an application in mesos://:5050. 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server mesos://:5050. Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86) at org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443) at org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281) at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90) ... 15 more Caused by: java.net.SocketException: Connection reset {code} I'm pretty sure this is because of the hardcoded http:// here: {code:java} RestSubmissionClient.scala /** Return the base URL for communicating with the server, including the protocol version. */ private def getBaseUrl(master: String): String = { var masterUrl = master supportedMasterPrefixes.foreach { prefix => if (master.startsWith(prefix)) { masterUrl = master.stripPrefix(prefix) } } masterUrl = masterUrl.stripSuffix("/") s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http } {code} Then I tried without the _--deploy-mode cluster_ and I get: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} On the spark console I get: {code:java} 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://_host:4040 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 1544450465799 I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2 I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at master@53.54.195.251:5050 I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting to register without authentication E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45206: Transport endpoint is not connected E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 307, address 53.54.195.251:45212: Transport endpoint is not connected E1210 15:01:05.969405 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45222:
[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl
[ https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-26324: -- Description: Hi guys, I was trying to run the examples on a mesos cluster that uses https. I tried with rest endpoint: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --conf spark.master.rest.enabled=true --deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} The error that I get on the host where I started the spark-submit is: {code:java} 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to launch an application in mesos://:5050. 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server mesos://:5050. Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86) at org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443) at org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281) at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90) ... 15 more Caused by: java.net.SocketException: Connection reset {code} I'm pretty sure this is because of the hardcoded http:// here: {code:java} RestSubmissionClient.scala /** Return the base URL for communicating with the server, including the protocol version. */ private def getBaseUrl(master: String): String = { var masterUrl = master supportedMasterPrefixes.foreach { prefix => if (master.startsWith(prefix)) { masterUrl = master.stripPrefix(prefix) } } masterUrl = masterUrl.stripSuffix("/") s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http } {code} Then I tried without the _--deploy-mode cluster_ and I get: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} On the spark console I get: {code:java} 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://_host:4040 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 1544450465799 I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2 I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at master@53.54.195.251:5050 I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting to register without authentication E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45206: Transport endpoint is not connected E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 307, address 53.54.195.251:45212: Transport endpoint is not connected E1210 15:01:05.969405 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45222:
[jira] [Created] (SPARK-26324) Spark submit does not work with messos over ssl
Jorge Machado created SPARK-26324: - Summary: Spark submit does not work with messos over ssl Key: SPARK-26324 URL: https://issues.apache.org/jira/browse/SPARK-26324 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.0 Reporter: Jorge Machado Hi guys, I was trying to run the examples on a mesos cluster that uses https. I tried with rest endpoint: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --conf spark.master.rest.enabled=true --deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} The error that I get on the host where I started the spark-submit is: {code:java} 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to launch an application in mesos://:5050. 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server mesos://:5050. Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86) at org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443) at org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281) at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225) at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90) ... 15 more Caused by: java.net.SocketException: Connection reset {code} I'm pretty sure this is because of the hardcoded http:// here: {code:java} RestSubmissionClient.scala /** Return the base URL for communicating with the server, including the protocol version. */ private def getBaseUrl(master: String): String = { var masterUrl = master supportedMasterPrefixes.foreach { prefix => if (master.startsWith(prefix)) { masterUrl = master.stripPrefix(prefix) } } masterUrl = masterUrl.stripSuffix("/") s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http } {code} Then I tried without the _--deploy-mode cluster_ and I get: {code:java} ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://:5050 --supervise --executor-memory 10G --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000 {code} On the spark console I get: {code:java} 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://_host:4040 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 1544450465799 I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2 I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at master@53.54.195.251:5050 I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting to register without authentication E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 307, address 53.54.195.251:45206: Transport endpoint is not connected E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 307,
[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs
[ https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714697#comment-16714697 ] Jorge Machado commented on SPARK-19320: --- Is this still active ? The question arises what about Kubernetes > Allow guaranteed amount of GPU to be used when launching jobs > - > > Key: SPARK-19320 > URL: https://issues.apache.org/jira/browse/SPARK-19320 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen >Priority: Major > > Currently the only configuration for using GPUs with Mesos is setting the > maximum amount of GPUs a job will take from an offer, but doesn't guarantee > exactly how much. > We should have a configuration that sets this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659253#comment-16659253 ] Jorge Machado commented on SPARK-20055: --- This was taken on another ticket. This Can be closed > Documentation for CSV datasets in SQL programming guide > --- > > Key: SPARK-20055 > URL: https://issues.apache.org/jira/browse/SPARK-20055 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > I guess things commonly used and important are documented there rather than > documenting everything and every option in the programming guide - > http://spark.apache.org/docs/latest/sql-programming-guide.html. > It seems JSON datasets > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > are documented whereas CSV datasets are not. > Nowadays, they are pretty similar in APIs and options. Some options are > notable for both, In particular, ones such as {{wholeFile}}. Moreover, > several options such as {{inferSchema}} and {{header}} are important in CSV > that affect the type/column name of data. > In that sense, I think we might better document CSV datasets with some > examples too because I believe reading CSV is pretty much common use cases. > Also, I think we could also leave some pointers for options of API > documentations for both (rather than duplicating the documentation). > So, my suggestion is, > - Add CSV Datasets section. > - Add links for options for both JSON and CSV that point each API > documentation > - Fix trivial minor fixes together in both sections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Attachment: (was: testDF.parquet) > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jorge Machado >Priority: Major > Fix For: 2.3.0 > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454 as result back. > > > As Code not 100% the same but I think there is really a bug there: > > {code:java} > BigDecimal [] objects = new BigDecimal[]{ > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D)}; > Row dataRow = new GenericRow(objects); > Row dataRow2 = new GenericRow(objects); > StructType structType = new StructType() > .add("id1", DataTypes.createDecimalType(38,10), true) > .add("id2", DataTypes.createDecimalType(38,10), true) > .add("id3", DataTypes.createDecimalType(38,10), true) > .add("id4", DataTypes.createDecimalType(38,10), true); > final Dataset dataFrame = > sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); > System.out.println(dataFrame.schema()); > dataFrame.show(); > final Dataset df1 = dataFrame.groupBy("id1","id2") > .agg( mean("id3").alias("projection_factor")); > df1.show(); > final Dataset df2 = df1 > .groupBy("id1") > .agg(max("projection_factor")); > df2.show(); > {code} > > The df2 should have: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 3.5714285714| > ++--+ > {code} > instead it returns: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 0.00035714285714| > ++--+ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado resolved SPARK-24401. --- Resolution: Fixed Was already Fixed on 2.3.0 > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jorge Machado >Priority: Major > Fix For: 2.3.0 > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454 as result back. > > > As Code not 100% the same but I think there is really a bug there: > > {code:java} > BigDecimal [] objects = new BigDecimal[]{ > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D)}; > Row dataRow = new GenericRow(objects); > Row dataRow2 = new GenericRow(objects); > StructType structType = new StructType() > .add("id1", DataTypes.createDecimalType(38,10), true) > .add("id2", DataTypes.createDecimalType(38,10), true) > .add("id3", DataTypes.createDecimalType(38,10), true) > .add("id4", DataTypes.createDecimalType(38,10), true); > final Dataset dataFrame = > sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); > System.out.println(dataFrame.schema()); > dataFrame.show(); > final Dataset df1 = dataFrame.groupBy("id1","id2") > .agg( mean("id3").alias("projection_factor")); > df1.show(); > final Dataset df2 = df1 > .groupBy("id1") > .agg(max("projection_factor")); > df2.show(); > {code} > > The df2 should have: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 3.5714285714| > ++--+ > {code} > instead it returns: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 0.00035714285714| > ++--+ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Affects Version/s: (was: 2.3.0) Fix Version/s: 2.3.0 > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jorge Machado >Priority: Major > Fix For: 2.3.0 > > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454 as result back. > > > As Code not 100% the same but I think there is really a bug there: > > {code:java} > BigDecimal [] objects = new BigDecimal[]{ > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D)}; > Row dataRow = new GenericRow(objects); > Row dataRow2 = new GenericRow(objects); > StructType structType = new StructType() > .add("id1", DataTypes.createDecimalType(38,10), true) > .add("id2", DataTypes.createDecimalType(38,10), true) > .add("id3", DataTypes.createDecimalType(38,10), true) > .add("id4", DataTypes.createDecimalType(38,10), true); > final Dataset dataFrame = > sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); > System.out.println(dataFrame.schema()); > dataFrame.show(); > final Dataset df1 = dataFrame.groupBy("id1","id2") > .agg( mean("id3").alias("projection_factor")); > df1.show(); > final Dataset df2 = df1 > .groupBy("id1") > .agg(max("projection_factor")); > df2.show(); > {code} > > The df2 should have: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 3.5714285714| > ++--+ > {code} > instead it returns: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 0.00035714285714| > ++--+ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493118#comment-16493118 ] Jorge Machado commented on SPARK-24401: --- Yes, I just tried my code and it works on spark 2.3.0 > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jorge Machado >Priority: Major > Fix For: 2.3.0 > > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454 as result back. > > > As Code not 100% the same but I think there is really a bug there: > > {code:java} > BigDecimal [] objects = new BigDecimal[]{ > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D)}; > Row dataRow = new GenericRow(objects); > Row dataRow2 = new GenericRow(objects); > StructType structType = new StructType() > .add("id1", DataTypes.createDecimalType(38,10), true) > .add("id2", DataTypes.createDecimalType(38,10), true) > .add("id3", DataTypes.createDecimalType(38,10), true) > .add("id4", DataTypes.createDecimalType(38,10), true); > final Dataset dataFrame = > sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); > System.out.println(dataFrame.schema()); > dataFrame.show(); > final Dataset df1 = dataFrame.groupBy("id1","id2") > .agg( mean("id3").alias("projection_factor")); > df1.show(); > final Dataset df2 = df1 > .groupBy("id1") > .agg(max("projection_factor")); > df2.show(); > {code} > > The df2 should have: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 3.5714285714| > ++--+ > {code} > instead it returns: > {code:java} > ++--+ > | id1|max(projection_factor)| > ++--+ > |3.5714285714| 0.00035714285714| > ++--+ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Description: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454 as result back. As Code not 100% the same but I think there is really a bug there: {code:java} BigDecimal [] objects = new BigDecimal[]{ new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D)}; Row dataRow = new GenericRow(objects); Row dataRow2 = new GenericRow(objects); StructType structType = new StructType() .add("id1", DataTypes.createDecimalType(38,10), true) .add("id2", DataTypes.createDecimalType(38,10), true) .add("id3", DataTypes.createDecimalType(38,10), true) .add("id4", DataTypes.createDecimalType(38,10), true); final Dataset dataFrame = sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); System.out.println(dataFrame.schema()); dataFrame.show(); final Dataset df1 = dataFrame.groupBy("id1","id2") .agg( mean("id3").alias("projection_factor")); df1.show(); final Dataset df2 = df1 .groupBy("id1") .agg(max("projection_factor")); df2.show(); {code} The df2 should have: {code:java} ++--+ | id1|max(projection_factor)| ++--+ |3.5714285714| 3.5714285714| ++--+ {code} instead it returns: {code:java} ++--+ | id1|max(projection_factor)| ++--+ |3.5714285714| 0.00035714285714| ++--+ {code} was: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454 as result back. As Code not 100% the same but I think there is really a bug there: {code:java} BigDecimal [] objects = new BigDecimal[]{ new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D)}; Row dataRow = new GenericRow(objects); Row dataRow2 = new GenericRow(objects); StructType structType = new StructType() .add("id1", DataTypes.createDecimalType(38,10), true) .add("id2", DataTypes.createDecimalType(38,10), true) .add("id3", DataTypes.createDecimalType(38,10), true) .add("id4", DataTypes.createDecimalType(38,10), true); final Dataset dataFrame = sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); System.out.println(dataFrame.schema()); dataFrame.show(); final Dataset df1 = dataFrame.groupBy("id1","id2") .agg( mean("id3").alias("projection_factor")); df1.show(); final Dataset df2 = df1 .groupBy("id1") .agg(max("projection_factor")); df2.show(); {code} > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", >
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Description: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454 as result back. As Code not 100% the same but I think there is really a bug there: {code:java} BigDecimal [] objects = new BigDecimal[]{ new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D)}; Row dataRow = new GenericRow(objects); Row dataRow2 = new GenericRow(objects); StructType structType = new StructType() .add("id1", DataTypes.createDecimalType(38,10), true) .add("id2", DataTypes.createDecimalType(38,10), true) .add("id3", DataTypes.createDecimalType(38,10), true) .add("id4", DataTypes.createDecimalType(38,10), true); final Dataset dataFrame = sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); System.out.println(dataFrame.schema()); dataFrame.show(); final Dataset df1 = dataFrame.groupBy("id1","id2") .agg( mean("id3").alias("projection_factor")); df1.show(); final Dataset df2 = df1 .groupBy("id1") .agg(max("projection_factor")); df2.show(); {code} was: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454 as result back. > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454 as result back. > > > As Code not 100% the same but I think there is really a bug there: > > {code:java} > BigDecimal [] objects = new BigDecimal[]{ > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D)}; > Row dataRow = new GenericRow(objects); > Row dataRow2 = new GenericRow(objects); > StructType structType = new StructType() > .add("id1", DataTypes.createDecimalType(38,10), true) > .add("id2", DataTypes.createDecimalType(38,10), true) > .add("id3", DataTypes.createDecimalType(38,10), true) > .add("id4", DataTypes.createDecimalType(38,10), true); > final Dataset dataFrame = > sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); > System.out.println(dataFrame.schema()); > dataFrame.show(); > final Dataset df1 = dataFrame.groupBy("id1","id2") > .agg(
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Description: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454 as result back. was: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 5. the result should not be bigger that 5 but with get 265820543091454 as result back. > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454 as result back. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Description: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 5. the result should not be bigger that 5 but with get 265820543091454 as result back. was: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 5. the result should not be bigger that 5 but with get > 265820543091454 as result back. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Attachment: testDF.parquet > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Attachment: (was: testDF.parquet) > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492326#comment-16492326 ] Jorge Machado commented on SPARK-24401: --- Probably is worth to mention that my dataset is coming from a oracle DB > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24401) Aggreate on Decimal Types does not work
Jorge Machado created SPARK-24401: - Summary: Aggreate on Decimal Types does not work Key: SPARK-24401 URL: https://issues.apache.org/jira/browse/SPARK-24401 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.0 Reporter: Jorge Machado Attachments: testDF.parquet Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work
[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-24401: -- Attachment: testDF.parquet > Aggreate on Decimal Types does not work > --- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jorge Machado >Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-17592: -- Comment: was deleted (was: I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe (that comes from oracle DB ) as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png!) > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185 ] Jorge Machado edited comment on SPARK-17592 at 5/24/18 3:11 PM: I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe (that comes from oracle DB ) as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png! was (Author: jomach): I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png! > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-17592: -- Attachment: image-2018-05-24-17-10-24-515.png > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185 ] Jorge Machado commented on SPARK-17592: --- I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png! > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194129#comment-16194129 ] Jorge Machado commented on SPARK-20055: --- [~aash] Should I copy paste that options ? And there is some docs already {code:java} def csv(paths: String*): DataFrame Permalink Loads CSV files and returns the result as a DataFrame. This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema. You can set the following CSV-specific options to deal with CSV files: sep (default ,): sets the single character as a separator for each field and value. encoding (default UTF-8): decodes the CSV files by the given encoding type. quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv. escape (default \): sets the single character used for escaping quotes inside an already quoted value. comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled. header (default false): uses the first line as names of columns. inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data. ignoreLeadingWhiteSpace (default false): a flag indicating whether or not leading whitespaces from values being read should be skipped. ignoreTrailingWhiteSpace (default false): a flag indicating whether or not trailing whitespaces from values being read should be skipped. nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type. nanValue (default NaN): sets the string representation of a non-number" value. positiveInf (default Inf): sets the string representation of a positive infinity value. negativeInf (default -Inf): sets the string representation of a negative infinity value. dateFormat (default -MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. timestampFormat (default -MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. maxColumns (default 20480): defines a hard limit of how many columns a record can have. maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields. DROPMALFORMED : ignores the whole corrupted records. FAILFAST : throws an exception when it meets corrupted records. columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord. multiLine (default false): parse one record, which may span multiple lines. {code} > Documentation for CSV datasets in SQL programming guide > --- > > Key: SPARK-20055 > URL: https://issues.apache.org/jira/browse/SPARK-20055 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > I guess things commonly used and important are documented there rather than > documenting everything and every option in the programming guide - > http://spark.apache.org/docs/latest/sql-programming-guide.html. > It seems JSON datasets > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > are documented whereas CSV datasets are not. > Nowadays, they are pretty similar in APIs and options. Some options are > notable for both, In particular, ones such as {{wholeFile}}. Moreover, > several options such as {{inferSchema}} and {{header}} are important in CSV > that affect the
[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191232#comment-16191232 ] Jorge Machado commented on SPARK-20055: --- [~hyukjin.kwon] just added some docs. > Documentation for CSV datasets in SQL programming guide > --- > > Key: SPARK-20055 > URL: https://issues.apache.org/jira/browse/SPARK-20055 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > I guess things commonly used and important are documented there rather than > documenting everything and every option in the programming guide - > http://spark.apache.org/docs/latest/sql-programming-guide.html. > It seems JSON datasets > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > are documented whereas CSV datasets are not. > Nowadays, they are pretty similar in APIs and options. Some options are > notable for both, In particular, ones such as {{wholeFile}}. Moreover, > several options such as {{inferSchema}} and {{header}} are important in CSV > that affect the type/column name of data. > In that sense, I think we might better document CSV datasets with some > examples too because I believe reading CSV is pretty much common use cases. > Also, I think we could also leave some pointers for options of API > documentations for both (rather than duplicating the documentation). > So, my suggestion is, > - Add CSV Datasets section. > - Add links for options for both JSON and CSV that point each API > documentation > - Fix trivial minor fixes together in both sections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222 ] Jorge Machado edited comment on SPARK-5236 at 3/22/17 12:52 PM: [~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a datasource for Hbase with some custom schema. I'm on 1.6.3 I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r) and on my example it tries to do instanceOf[MutableLong] from a String which fails. I have a filter on the dataframe and a groupby count {noformat} /** * * @param schema this is how the row has to look like. The returned value from the next must match this schema * @param hBaseRelation * @param repositoryHistory * @param timeZoneId * @param tablePartitionInfo * @param from * @param to */ class TagValueSparkIterator(val hBaseRelation: HBaseRelation, val schema: StructType, val repositoryHistory: DeviceHistoryRepository, val timeZoneId: String, val tablePartitionInfo: TablePartitionInfo, val from: Long, val to: Long) extends Iterator[InternalRow] { private val internalItr: ClosableIterator[TagValue[Double]]= repositoryHistory.scanTagValues(from, to, tablePartitionInfo) override def hasNext: Boolean = internalItr.hasNext override def next(): InternalRow = { val tagValue = internalItr.next() val instant = ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), ZoneId.of(timeZoneId)).toInstant val timestamp = Timestamp.from(instant) InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue)) val mutableRow = new SpecificMutableRow(schema.fields.map(f=> f.dataType)) for (i <- schema.fields.indices){ updateMutableRow(i,tagValue,mutableRow, schema(i) ) } mutableRow } def updateMutableRow(i: Int, tagValue: TagValue[Double], row: SpecificMutableRow, field:StructField): Unit = { //#TODO this is ugly. field.name match { case "Date" => row.setLong(i,tagValue.getTimestamp.toLong) case "Device" => row.update(i,UTF8String.fromString(tagValue.getGuid)) case "Tag" => row.update(i,UTF8String.fromString(tagValue.getTagName)) case "TagValue" => row.setDouble(i,tagValue.getValue) } } override def toString():String ={ "Iterator for Region Name "+tablePartitionInfo.getRegionName+" Range:"+from+ "until" + "to" } } {noformat} Then I get : {noformat} Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68) at org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74) at org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) {noformat} was (Author: jomach): [~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a datasource for Hbase with some custom schema. I'm on 1.6.3 I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r) {noformat} /** * * @param schema this is how the row has to look like. The returned value from the next must match this schema * @param hBaseRelation * @param repositoryHistory * @param timeZoneId * @param tablePartitionInfo * @param from * @param to */ class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
[jira] [Comment Edited] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222 ] Jorge Machado edited comment on SPARK-5236 at 3/22/17 12:47 PM: [~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a datasource for Hbase with some custom schema. I'm on 1.6.3 I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r) {noformat} /** * * @param schema this is how the row has to look like. The returned value from the next must match this schema * @param hBaseRelation * @param repositoryHistory * @param timeZoneId * @param tablePartitionInfo * @param from * @param to */ class TagValueSparkIterator(val hBaseRelation: HBaseRelation, val schema: StructType, val repositoryHistory: DeviceHistoryRepository, val timeZoneId: String, val tablePartitionInfo: TablePartitionInfo, val from: Long, val to: Long) extends Iterator[InternalRow] { private val internalItr: ClosableIterator[TagValue[Double]]= repositoryHistory.scanTagValues(from, to, tablePartitionInfo) override def hasNext: Boolean = internalItr.hasNext override def next(): InternalRow = { val tagValue = internalItr.next() val instant = ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), ZoneId.of(timeZoneId)).toInstant val timestamp = Timestamp.from(instant) InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue)) val mutableRow = new SpecificMutableRow(schema.fields.map(f=> f.dataType)) for (i <- schema.fields.indices){ updateMutableRow(i,tagValue,mutableRow, schema(i) ) } mutableRow } def updateMutableRow(i: Int, tagValue: TagValue[Double], row: SpecificMutableRow, field:StructField): Unit = { //#TODO this is ugly. field.name match { case "Date" => row.setLong(i,tagValue.getTimestamp.toLong) case "Device" => row.update(i,UTF8String.fromString(tagValue.getGuid)) case "Tag" => row.update(i,UTF8String.fromString(tagValue.getTagName)) case "TagValue" => row.setDouble(i,tagValue.getValue) } } override def toString():String ={ "Iterator for Region Name "+tablePartitionInfo.getRegionName+" Range:"+from+ "until" + "to" } } {noformat} Then I get : {noformat} Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68) at org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74) at org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) {noformat} was (Author: jomach): [~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a datasource for Hbase with some custom schema. {noformat} /** * * @param schema this is how the row has to look like. The returned value from the next must match this schema * @param hBaseRelation * @param repositoryHistory * @param timeZoneId * @param tablePartitionInfo * @param from * @param to */ class TagValueSparkIterator(val hBaseRelation: HBaseRelation, val schema: StructType, val repositoryHistory: DeviceHistoryRepository, val timeZoneId: String,
[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222 ] Jorge Machado commented on SPARK-5236: -- [~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a datasource for Hbase with some custom schema. {noformat} /** * * @param schema this is how the row has to look like. The returned value from the next must match this schema * @param hBaseRelation * @param repositoryHistory * @param timeZoneId * @param tablePartitionInfo * @param from * @param to */ class TagValueSparkIterator(val hBaseRelation: HBaseRelation, val schema: StructType, val repositoryHistory: DeviceHistoryRepository, val timeZoneId: String, val tablePartitionInfo: TablePartitionInfo, val from: Long, val to: Long) extends Iterator[InternalRow] { private val internalItr: ClosableIterator[TagValue[Double]]= repositoryHistory.scanTagValues(from, to, tablePartitionInfo) override def hasNext: Boolean = internalItr.hasNext override def next(): InternalRow = { val tagValue = internalItr.next() val instant = ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), ZoneId.of(timeZoneId)).toInstant val timestamp = Timestamp.from(instant) InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue)) val mutableRow = new SpecificMutableRow(schema.fields.map(f=> f.dataType)) for (i <- schema.fields.indices){ updateMutableRow(i,tagValue,mutableRow, schema(i) ) } mutableRow } def updateMutableRow(i: Int, tagValue: TagValue[Double], row: SpecificMutableRow, field:StructField): Unit = { //#TODO this is ugly. field.name match { case "Date" => row.setLong(i,tagValue.getTimestamp.toLong) case "Device" => row.update(i,UTF8String.fromString(tagValue.getGuid)) case "Tag" => row.update(i,UTF8String.fromString(tagValue.getTagName)) case "TagValue" => row.setDouble(i,tagValue.getValue) } } override def toString():String ={ "Iterator for Region Name "+tablePartitionInfo.getRegionName+" Range:"+from+ "until" + "to" } } {noformat} Then I get : {noformat} Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68) at org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74) at org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) {noformat} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > - > > Key: SPARK-5236 > URL: https://issues.apache.org/jira/browse/SPARK-5236 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Alex Baretta > > {code} > 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 > (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value > at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet > at >
[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns
[ https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840227#comment-15840227 ] Jorge Machado commented on SPARK-15505: --- Sometimes you read from a database that has an array in it, at least was in my case. But fell free to reject it. > Explode nested Array in DF Column into Multiple Columns > > > Key: SPARK-15505 > URL: https://issues.apache.org/jira/browse/SPARK-15505 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Jorge Machado >Priority: Minor > > At the moment if we have a DF like this : > {noformat} > +--+-+ > | Col1 | Col2| > +--+-+ > | 1 |[2, 3, 4]| > | 1 |[2, 3, 4]| > +--+-+ > {noformat} > There is no way to directly transform it into : > {noformat} > +--+--+--+--+ > | Col1 | Col2 | Col3 | Col4 | > +--+--+--+--+ > | 1 | 2 | 3 | 4 | > | 1 | 2 | 3 | 4 | > +--+--+--+--+ > {noformat} > I think this should be easy to implement > More infos here : > http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns
[ https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839365#comment-15839365 ] Jorge Machado commented on SPARK-15505: --- [~hyukjin.kwon] you don't really always know if you have two our five elements in the second column... > Explode nested Array in DF Column into Multiple Columns > > > Key: SPARK-15505 > URL: https://issues.apache.org/jira/browse/SPARK-15505 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Jorge Machado >Priority: Minor > > At the moment if we have a DF like this : > {noformat} > +--+-+ > | Col1 | Col2| > +--+-+ > | 1 |[2, 3, 4]| > | 1 |[2, 3, 4]| > +--+-+ > {noformat} > There is no way to directly transform it into : > {noformat} > +--+--+--+--+ > | Col1 | Col2 | Col3 | Col4 | > +--+--+--+--+ > | 1 | 2 | 3 | 4 | > | 1 | 2 | 3 | 4 | > +--+--+--+--+ > {noformat} > I think this should be easy to implement > More infos here : > http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18858) reduceByKey not avaiable on Dataset
Jorge Machado created SPARK-18858: - Summary: reduceByKey not avaiable on Dataset Key: SPARK-18858 URL: https://issues.apache.org/jira/browse/SPARK-18858 Project: Spark Issue Type: Bug Affects Versions: 2.0.2 Reporter: Jorge Machado Priority: Minor Hi, I don´t really know if this is a bug or not. But having a Dataset it should be possible to do reduceByKey or not ? at the moment I have done the workaround with ds.rdd.ReduceByKey. reduce does not have the same signature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns
Jorge Machado created SPARK-15505: - Summary: Explode nested Array in DF Column into Multiple Columns Key: SPARK-15505 URL: https://issues.apache.org/jira/browse/SPARK-15505 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.1 Reporter: Jorge Machado Priority: Minor At the moment if we have a DF like this : +--+-+ | Col1 | Col2| +--+-+ | 1 |[2, 3, 4]| | 1 |[2, 3, 4]| +--+-+ There is no way to directly transform it into : +--+--+--+--+ | Col1 | Col2 | Col3 | Col4 | +--+--+--+--+ | 1 | 2 | 3 | 4 | | 1 | 2 | 3 | 4 | +--+--+--+--+ I think this should be easy to implement More infos here : http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270350#comment-15270350 ] Jorge Machado commented on SPARK-12988: --- +1 > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*
[ https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805180#comment-14805180 ] Jorge Machado commented on SPARK-10286: --- Hi, I´m new here. First post. I think this issue was already solved on : SPARK-10281. > Add @since annotation to pyspark.ml.param and pyspark.ml.* > -- > > Key: SPARK-10286 > URL: https://issues.apache.org/jira/browse/SPARK-10286 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org