[jira] [Commented] (SPARK-15615) Support for creating a dataframe from JSON in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884558#comment-15884558 ] Apache Spark commented on SPARK-15615: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17071 > Support for creating a dataframe from JSON in Dataset[String] > -- > > Key: SPARK-15615 > URL: https://issues.apache.org/jira/browse/SPARK-15615 > Project: Spark > Issue Type: Bug >Reporter: PJ Fanning >Assignee: PJ Fanning > Fix For: 2.2.0 > > > We should deprecate DataFrameReader.scala json(rdd: RDD[String]) and support > json(ds: Dataset[String]) instead -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14772: -- Fix Version/s: 2.2.0 > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler > Fix For: 2.1.1, 2.2.0 > > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14772. --- Resolution: Fixed Fix Version/s: (was: 2.2.0) 2.1.1 Issue resolved by pull request 17048 [https://github.com/apache/spark/pull/17048] > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler > Fix For: 2.1.1 > > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884487#comment-15884487 ] Ji Yan commented on SPARK-19740: the problem is that when running Spark on Mesos, there is no way to run Spark executor as non-root user > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14393) values generated by non-deterministic functions shouldn't change after coalesce or union
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884486#comment-15884486 ] Everett Anderson commented on SPARK-14393: -- Appears this also happens in 2.0.2. Thanks for fixing this! I disagree that monotonically_increasing_id function should ever be allowed to create duplicate values in a table given its documentation is "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive." We were certainly relying on it to produce unique values! > values generated by non-deterministic functions shouldn't change after > coalesce or union > > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.0.1 >Reporter: Jason Piper >Assignee: Xiangrui Meng >Priority: Blocker > Labels: correctness, releasenotes > Fix For: 2.1.0 > > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884447#comment-15884447 ] Sunitha Kambhampati commented on SPARK-19602: - I have made the changes to support three part column name. In order to aid in the review and to reduce the diff, the test scenarios are separated out into the above PR: 17067. > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19709) CSV datasource fails to read empty file
[ https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19709: Assignee: Apache Spark > CSV datasource fails to read empty file > --- > > Key: SPARK-19709 > URL: https://issues.apache.org/jira/browse/SPARK-19709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > I just {{touch a}} and then ran the codes below: > {code} > scala> spark.read.csv("a") > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike. > {code} > It seems we should produce an empty dataframe consistently with > `spark.read.json("a")`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19709) CSV datasource fails to read empty file
[ https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19709: Assignee: (was: Apache Spark) > CSV datasource fails to read empty file > --- > > Key: SPARK-19709 > URL: https://issues.apache.org/jira/browse/SPARK-19709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > I just {{touch a}} and then ran the codes below: > {code} > scala> spark.read.csv("a") > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike. > {code} > It seems we should produce an empty dataframe consistently with > `spark.read.json("a")`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19709) CSV datasource fails to read empty file
[ https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884446#comment-15884446 ] Apache Spark commented on SPARK-19709: -- User 'wojtek-szymanski' has created a pull request for this issue: https://github.com/apache/spark/pull/17068 > CSV datasource fails to read empty file > --- > > Key: SPARK-19709 > URL: https://issues.apache.org/jira/browse/SPARK-19709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > I just {{touch a}} and then ran the codes below: > {code} > scala> spark.read.csv("a") > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike. > {code} > It seems we should produce an empty dataframe consistently with > `spark.read.json("a")`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19602: Assignee: (was: Apache Spark) > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884443#comment-15884443 ] Apache Spark commented on SPARK-19602: -- User 'skambha' has created a pull request for this issue: https://github.com/apache/spark/pull/17067 > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19602: Assignee: Apache Spark > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati >Assignee: Apache Spark > Attachments: Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: (was: Design_ColResolution_JIRA19602.docx) > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: Design_ColResolution_JIRA19602.pdf > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.docx, > Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: Design_ColResolution_JIRA19602.docx > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.docx > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: (was: Design_ColResolution_JIRA19602.docx) > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884432#comment-15884432 ] Sean Owen commented on SPARK-19740: --- What is the bug or Spark problem here? (We use pull requests, not links to branches) > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884426#comment-15884426 ] Ji Yan commented on SPARK-19740: proposed change: https://github.com/yanji84/spark/commit/4f8368ea727e5689e96794884b8d1baf3eccb5d5 > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Yan updated SPARK-19740: --- Description: When running Spark on Mesos with docker containerizer, the spark executors are always launched with 'docker run' command without specifying --user option, which always results in spark executors running as root. Mesos has a way to support arbitrary parameters. Spark could use that to expose setting user background on mesos with arbitrary parameters support: https://issues.apache.org/jira/browse/MESOS-1816 was:When running Spark on Mesos with docker containerizer, the spark executors are always launched with 'docker run' command without specifying --user option, which always results in spark executors running as root. Mesos has a way to support arbitrary parameters. Spark could use that to expose setting user > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19740) Spark executor always runs as root when running on mesos
Ji Yan created SPARK-19740: -- Summary: Spark executor always runs as root when running on mesos Key: SPARK-19740 URL: https://issues.apache.org/jira/browse/SPARK-19740 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.1.0 Reporter: Ji Yan When running Spark on Mesos with docker containerizer, the spark executors are always launched with 'docker run' command without specifying --user option, which always results in spark executors running as root. Mesos has a way to support arbitrary parameters. Spark could use that to expose setting user -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15288) Mesos dispatcher should handle gracefully when any thread gets UncaughtException
[ https://issues.apache.org/jira/browse/SPARK-15288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15288. --- Resolution: Fixed Assignee: Devaraj K Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/13072 > Mesos dispatcher should handle gracefully when any thread gets > UncaughtException > > > Key: SPARK-15288 > URL: https://issues.apache.org/jira/browse/SPARK-15288 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Fix For: 2.2.0 > > > Any one thread of the Mesos dispatcher gets any of the UncaughtException then > the thread gets terminate and the dispatcher process keeps running without > functioning properly. > I think we need to handle the UncaughtException and shutdown the Mesos > dispatcher. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19673) ThriftServer default app name is changed wrong
[ https://issues.apache.org/jira/browse/SPARK-19673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19673. --- Resolution: Fixed Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/17010 > ThriftServer default app name is changed wrong > -- > > Key: SPARK-19673 > URL: https://issues.apache.org/jira/browse/SPARK-19673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: LvDongrong >Assignee: LvDongrong >Priority: Trivial > Fix For: 2.2.0 > > > In spark 1.x ,the name of ThriftServer is SparkSQL:localHostName. While the > ThriftServer default name is changed to the className of HiveThfiftServer2 > (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2) , which is not > appropriate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19673) ThriftServer default app name is changed wrong
[ https://issues.apache.org/jira/browse/SPARK-19673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19673: - Assignee: LvDongrong Priority: Trivial (was: Major) > ThriftServer default app name is changed wrong > -- > > Key: SPARK-19673 > URL: https://issues.apache.org/jira/browse/SPARK-19673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: LvDongrong >Assignee: LvDongrong >Priority: Trivial > Fix For: 2.2.0 > > > In spark 1.x ,the name of ThriftServer is SparkSQL:localHostName. While the > ThriftServer default name is changed to the className of HiveThfiftServer2 > (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2) , which is not > appropriate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19739) SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars
Steve Loughran created SPARK-19739: -- Summary: SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars Key: SPARK-19739 URL: https://issues.apache.org/jira/browse/SPARK-19739 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Steve Loughran Priority: Minor {{SparkHadoopUtil.appendS3AndSparkHadoopConfigurations()}} propagates the AWS user and secret key to s3n and s3a config options, so getting secrets from the user to the cluster, if set. AWS also supports session authentication (env var {{AWS_SESSION_TOKEN}}) and region endpoints {{AWS_DEFAULT_REGION}}, the latter being critical if you want to address V4-auth-only endpoints like frankfurt and Seol. These env vars should be picked up and passed down to S3a too. 4+ lines of code, though impossible to test unless the existing code is refactored to take the env var map[String, String], so allowing a test suite to set the values in itds own map. side issue: what if only half the env vars are set and users are trying to understand why auth is failing? It may be good to build up a string identifying which env vars had their value propagate, and log that @ debug, while not logging the values, obviously. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19725) different parquet dependency in spark2.0.x and Hive2.x cause failure of HoS when using parquet file format
[ https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884294#comment-15884294 ] Sean Owen commented on SPARK-19725: --- We wouldn't update this dependency in a maintenance release of 2.0.x or 2.1.x. > different parquet dependency in spark2.0.x and Hive2.x cause failure of HoS > when using parquet file format > -- > > Key: SPARK-19725 > URL: https://issues.apache.org/jira/browse/SPARK-19725 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hive2.2 > hadoop2.7.1 >Reporter: KaiXu > > the parquet version in hive2.x is 1.8.1 while in spark2.0.x is 1.7.0, so when > run HoS queries using parquet file format would encounter some jars conflict > problems: > Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd > Job failed with java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; > FAILED: Execution Error, return code 3 from > org.apache.hadoop.hive.ql.exec.spark.SparkTask. > java.util.concurrent.ExecutionException: Exception thrown by job > at > org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272) > at > org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in > stage 1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing > row: java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; > at > org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149) > at > org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48) > at > org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27) > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100) > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56) > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50) > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115) > at > org.apache.hadoop.hive.ql.io.HiveFileForma
[jira] [Updated] (SPARK-19733) ALS performs unnecessary casting on item and user ids
[ https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vasilis Vryniotis updated SPARK-19733: -- Affects Version/s: (was: 1.6.3) > ALS performs unnecessary casting on item and user ids > - > > Key: SPARK-19733 > URL: https://issues.apache.org/jira/browse/SPARK-19733 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Vasilis Vryniotis > > The ALS is performing unnecessary casting to the user and item ids (to > double). I believe this is because the protected checkedCast() method > requires a double input. This can be avoided by refactroing the code of > checkedCast method. > Issue resolved by pull-request 17059: > https://github.com/apache/spark/pull/17059 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14222. --- Resolution: Done jackson-module-scala is cross-published for 2.12 from version 2.7.9 and 2.8.+ onwards, which I think may be sufficient to call this part done. > Cross-publish jackson-module-scala for Scala 2.12 > - > > Key: SPARK-14222 > URL: https://issues.apache.org/jira/browse/SPARK-14222 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > In order to build Spark against Scala 2.12, we need to either remove our > jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. > Personally, I'd prefer to remove it because I don't think we make extensive > use of it and because I'm not a huge fan of the implicit mapping between case > classes and JSON wire formats (the extra verbosity required by other > approaches is a feature, IMO, rather than a bug because it makes it much > harder to accidentally break wire compatibility). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14221) Cross-publish Chill for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14221. --- Resolution: Done Chill 0.8.2 and onwards are cross-published for 2.12, so I think that resolves this, as we're on 0.8.0 already. > Cross-publish Chill for Scala 2.12 > -- > > Key: SPARK-14221 > URL: https://issues.apache.org/jira/browse/SPARK-14221 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > We need to cross-publish Chill in order to build against Scala 2.12. > Upstream issue: https://github.com/twitter/chill/issues/252 > I tried building and testing {{chill-scala}} against 2.12.0-M3 and ran into > multiple failed tests due to issues with Java8 lambda serialization (similar > to https://github.com/EsotericSoftware/kryo/issues/215), so this task will be > slightly more involved then just bumping the dependencies in the Chill build. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19738) Consider adding error handler to DataStreamWriter
Jayesh lalwani created SPARK-19738: -- Summary: Consider adding error handler to DataStreamWriter Key: SPARK-19738 URL: https://issues.apache.org/jira/browse/SPARK-19738 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Jayesh lalwani For Structured streaming implementations, it is important that the applications stay always On. However, right now, errors stop the driver. In some cases, this is not desirable behavior. For example, I have the following application {code} import org.apache.spark.sql.types._ val userSchema = new StructType().add("name", "string").add("age", "integer") val csvDF = spark.readStream.schema(userSchema).csv("s3://bucket/jayesh/streamingerror/") csvDF.writeStream.format("console").start() {code} I send the following input to it {quote} 1,Iron man 2,SUperman {quote} Obviously, the data is bad. This causes the executor to throw an exception that propogates to the driver, which promptly shuts down. The driver is running in supervised mode, and it gets restarted. The application reads the same bad input and shuts down again. This goes ad-infinitum. This behavior is desirable, in cases, the error is recoverable. For example, if the executor cannot talk to the database, we want the application to keep trying the same input again and again till the database recovers. However, for some cases, this behavior is undesirable. We do not want this to happen when the input is bad. We want to put the bad record in some sort of dead letter queue. Or maybe we want to kill the driver only when the number of errors have crossed a certain threshold. Or maybe we want to email someone. Proposal: Add a error handler to the data stream. When the executor fails, it should call the error handler and pass the Exception to the error handler. The error handler could eat the exception, or transform it, or update counts in an accumulator, etc {code} import org.apache.spark.sql.types._ val userSchema = new StructType().add("name", "string").add("age", "integer") val csvDF = spark.readStream.schema(userSchema).csv("s3://bucket/jayesh/streamingerror/") csvDF.writeStream.format("console").errorhandler("com.jayesh.ErrorHandler").start() {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19715) Option to Strip Paths in FileSource
[ https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884230#comment-15884230 ] Steve Loughran edited comment on SPARK-19715 at 2/25/17 1:24 PM: - OK. I'd recommend going with {{Path.getURI.getPath()}} to get the full path, though there's the always the risk of >1 s3a bucket referring to the same objects Some filesystems (HDFS) have checksums you can ask for, though S3a doesn't, yet: HADOOP-13282 has discussed serving up etags, primarily to aid distcp updates. If added, you could use that as the differentiator, or at least to identify changed files. Patches welcome, [with tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md] To be ruthless, it may have been simpler for the user just to edit the fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the same was (Author: ste...@apache.org): OK. I'd recommend going twith Path.getURI.getPath() to get the full path, though there's the always the risk of >1 s3a bucket referring to the same objects Some filesystems (HDFS, file:) have checksums you can ask for, though S3a doesn't, yet: HADOOP-13282 has discussed serving up etags, primarily to aid distcp updates. If added, you could use that as the differentiator, or at least to identify changed files To be ruthless, it may have been simpler for the user just to edit the fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the same > Option to Strip Paths in FileSource > --- > > Key: SPARK-19715 > URL: https://issues.apache.org/jira/browse/SPARK-19715 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust > > Today, we compare the whole path when deciding if a file is new in the > FileSource for structured streaming. However, this cause cause false > negatives in the case where the path has changed in a cosmetic way (i.e. > changing s3n to s3a). We should add an option {{fileNameOnly}} that causes > the new file check to be based only on the filename (but still store the > whole path in the log). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource
[ https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884230#comment-15884230 ] Steve Loughran commented on SPARK-19715: OK. I'd recommend going twith Path.getURI.getPath() to get the full path, though there's the always the risk of >1 s3a bucket referring to the same objects Some filesystems (HDFS, file:) have checksums you can ask for, though S3a doesn't, yet: HADOOP-13282 has discussed serving up etags, primarily to aid distcp updates. If added, you could use that as the differentiator, or at least to identify changed files To be ruthless, it may have been simpler for the user just to edit the fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the same > Option to Strip Paths in FileSource > --- > > Key: SPARK-19715 > URL: https://issues.apache.org/jira/browse/SPARK-19715 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust > > Today, we compare the whole path when deciding if a file is new in the > FileSource for structured streaming. However, this cause cause false > negatives in the case where the path has changed in a cosmetic way (i.e. > changing s3n to s3a). We should add an option {{fileNameOnly}} that causes > the new file check to be based only on the filename (but still store the > whole path in the log). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19725) different parquet dependency in spark2.0.x and Hive2.x cause failure of HoS when using parquet file format
[ https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXu updated SPARK-19725: -- Description: the parquet version in hive2.x is 1.8.1 while in spark2.0.x is 1.7.0, so when run HoS queries using parquet file format would encounter some jars conflict problems: Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd Job failed with java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. java.util.concurrent.ExecutionException: Exception thrown by job at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272) at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing row: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; at org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100) at org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56) at org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50) at org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:609) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:553) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:664) at org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:101) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:137) was: the parquet version in hive2.x is
[jira] [Commented] (SPARK-19725) different parquet dependency in spark2.x and Hive2.x cause failure of HoS when using parquet file format
[ https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884208#comment-15884208 ] KaiXu commented on SPARK-19725: --- Hive supports spark2.x through HIVE-14029, spark supports Hive 2(SPARK-13446) is on the way. Yes, maybe I should change the tittle to spark2.0.x. > different parquet dependency in spark2.x and Hive2.x cause failure of HoS > when using parquet file format > > > Key: SPARK-19725 > URL: https://issues.apache.org/jira/browse/SPARK-19725 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hive2.2 > hadoop2.7.1 >Reporter: KaiXu > > the parquet version in hive2.x is 1.8.1 while in spark2.x is 1.7.0, so when > run HoS queries using parquet file format would encounter some jars conflict > problems: > Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd > Job failed with java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; > FAILED: Execution Error, return code 3 from > org.apache.hadoop.hive.ql.exec.spark.SparkTask. > java.util.concurrent.ExecutionException: Exception thrown by job > at > org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272) > at > org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in > stage 1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing > row: java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; > at > org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149) > at > org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48) > at > org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27) > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > at > org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder; > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100) > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56) > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50) > at > org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115) >