[jira] [Resolved] (SPARK-21238) allow nested SQL execution
[ https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21238. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18450 [https://github.com/apache/spark/pull/18450] > allow nested SQL execution > -- > > Key: SPARK-21238 > URL: https://issues.apache.org/jira/browse/SPARK-21238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-3577: -- Assignee: Sital Kedia > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Assignee: Sital Kedia >Priority: Minor > Fix For: 2.3.0 > > Attachments: spill_size.jpg > > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-3577. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17471 [https://github.com/apache/spark/pull/17471] > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > Fix For: 2.3.0 > > Attachments: spill_size.jpg > > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
Amit Baghel created SPARK-21249: --- Summary: Is it possible to use File Sink with mapGroupsWithState in Structured Streaming? Key: SPARK-21249 URL: https://issues.apache.org/jira/browse/SPARK-21249 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Amit Baghel Priority: Minor I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? With append output mode I am getting below exception. Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR
[ https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067679#comment-16067679 ] Apache Spark commented on SPARK-21093: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18463 > Multiple gapply execution occasionally failed in SparkR > > > Key: SPARK-21093 > URL: https://issues.apache.org/jira/browse/SPARK-21093 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 > Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.3.0 > > > On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks > failed as below: > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 > in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage > 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: > R computation failed with > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.a > ... > *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated > === Backtrace: = > /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597] > /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750] > /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507] > /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015] > /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e] > /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4] > /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529] > /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7] > /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1] > /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9] > /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /u
[jira] [Updated] (SPARK-21223) Thread-safety issue in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-21223: -- Attachment: historyserver_jstack.txt BTW, this cause an infinite loop problem when we restart historyserver and replaying event logs of spark apps. > Thread-safety issue in FsHistoryProvider > - > > Key: SPARK-21223 > URL: https://issues.apache.org/jira/browse/SPARK-21223 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: zenglinxi > Attachments: historyserver_jstack.txt > > > Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class > FsHistoryProvider to store the map of eventlog path and attemptInfo. > When use ThreadPool to Replay the log files in the list and merge the list of > old applications with new ones, multi thread may update fileToAppInfo at the > same time, which may cause Thread-safety issues. > {code:java} > for (file <- logInfos) { >tasks += replayExecutor.submit(new Runnable { > override def run(): Unit = mergeApplicationListing(file) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21237) Invalidate stats once table data is changed
[ https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21237. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18449 [https://github.com/apache/spark/pull/18449] > Invalidate stats once table data is changed > --- > > Key: SPARK-21237 > URL: https://issues.apache.org/jira/browse/SPARK-21237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed
[ https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21237: --- Assignee: Zhenhua Wang > Invalidate stats once table data is changed > --- > > Key: SPARK-21237 > URL: https://issues.apache.org/jira/browse/SPARK-21237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21229) remove QueryPlan.preCanonicalized
[ https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21229. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18440 [https://github.com/apache/spark/pull/18440] > remove QueryPlan.preCanonicalized > - > > Key: SPARK-21229 > URL: https://issues.apache.org/jira/browse/SPARK-21229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR
[ https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-21208: - Comment: was deleted (was: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18431) > Ability to "setLocalProperty" from sc, in sparkR > > > Key: SPARK-21208 > URL: https://issues.apache.org/jira/browse/SPARK-21208 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Karuppayya > > Checked the API > [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for > sparkR. > Was not able to find a way to *setLocalProperty* on sc. > Need ability to *setLocalProperty* on sparkContext(similar to available for > pyspark, scala) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR
[ https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21093: Assignee: Apache Spark (was: Hyukjin Kwon) > Multiple gapply execution occasionally failed in SparkR > > > Key: SPARK-21093 > URL: https://issues.apache.org/jira/browse/SPARK-21093 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 > Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Critical > Fix For: 2.3.0 > > > On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks > failed as below: > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 > in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage > 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: > R computation failed with > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.a > ... > *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated > === Backtrace: = > /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597] > /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750] > /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507] > /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015] > /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e] > /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4] > /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529] > /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7] > /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1] > /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9] > /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
[jira] [Assigned] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR
[ https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21093: Assignee: Hyukjin Kwon (was: Apache Spark) > Multiple gapply execution occasionally failed in SparkR > > > Key: SPARK-21093 > URL: https://issues.apache.org/jira/browse/SPARK-21093 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 > Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.3.0 > > > On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks > failed as below: > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 > in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage > 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: > R computation failed with > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.a > ... > *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated > === Backtrace: = > /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597] > /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750] > /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507] > /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015] > /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e] > /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4] > /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529] > /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7] > /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1] > /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9] > /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
[jira] [Comment Edited] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR
[ https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067637#comment-16067637 ] Felix Cheung edited comment on SPARK-21093 at 6/29/17 3:12 AM: --- this was reverted. we seems to be getting random test termination with error code -10 after this is merged. was (Author: felixcheung): this was reverted. > Multiple gapply execution occasionally failed in SparkR > > > Key: SPARK-21093 > URL: https://issues.apache.org/jira/browse/SPARK-21093 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 > Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.3.0 > > > On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks > failed as below: > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 > in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage > 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: > R computation failed with > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.a > ... > *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated > === Backtrace: = > /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597] > /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750] > /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507] > /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015] > /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e] > /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4] > /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529] > /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7] > /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1] > /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9] > /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/li
[jira] [Reopened] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR
[ https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reopened SPARK-21093: -- this was reverted. > Multiple gapply execution occasionally failed in SparkR > > > Key: SPARK-21093 > URL: https://issues.apache.org/jira/browse/SPARK-21093 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 > Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.3.0 > > > On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks > failed as below: > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 > in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage > 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: > R computation failed with > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.a > ... > *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated > === Backtrace: = > /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597] > /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750] > /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507] > /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015] > /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e] > /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4] > /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529] > /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7] > /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1] > /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9] > /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(+0x11
[jira] [Commented] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067620#comment-16067620 ] Yuming Wang commented on SPARK-21246: - {{Seq(3)}} should be {{Seq(3L)}}, This works for me: {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schemaString = "name" val lstVals = Seq(3L) val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) rowRdd.collect() // Generate the schema based on the string of schema val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, LongType, nullable = true)) val schema = StructType(fields) print(schema) val peopleDF = spark.createDataFrame(rowRdd, schema) peopleDF.show() {code} > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21224) Support a DDL-formatted string as schema in reading for R
[ https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-21224: Assignee: Hyukjin Kwon > Support a DDL-formatted string as schema in reading for R > - > > Key: SPARK-21224 > URL: https://issues.apache.org/jira/browse/SPARK-21224 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > This might have to be a followup for SPARK-20431 but I just decided to make > this separate for R specifically as many PRs might be confusing. > Please refer the discussion in the PR and SPARK-20431. > In a simple view, this JIRA describes the support for a DDL-formetted string > as schema as below: > {code} > mockLines <- c("{\"name\":\"Michael\"}", >"{\"name\":\"Andy\", \"age\":30}", >"{\"name\":\"Justin\", \"age\":19}") > jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") > writeLines(mockLines, jsonPath) > df <- read.df(jsonPath, "json", "name STRING, age DOUBLE") > collect(df) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R
[ https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067608#comment-16067608 ] Felix Cheung commented on SPARK-21224: -- let's add this to from_json, gapply, dapply too as discussed? free feel to open a new JIRA if you think that's better. > Support a DDL-formatted string as schema in reading for R > - > > Key: SPARK-21224 > URL: https://issues.apache.org/jira/browse/SPARK-21224 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This might have to be a followup for SPARK-20431 but I just decided to make > this separate for R specifically as many PRs might be confusing. > Please refer the discussion in the PR and SPARK-20431. > In a simple view, this JIRA describes the support for a DDL-formetted string > as schema as below: > {code} > mockLines <- c("{\"name\":\"Michael\"}", >"{\"name\":\"Andy\", \"age\":30}", >"{\"name\":\"Justin\", \"age\":19}") > jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") > writeLines(mockLines, jsonPath) > df <- read.df(jsonPath, "json", "name STRING, age DOUBLE") > collect(df) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers
[ https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-21225: - Issue Type: Bug (was: Improvement) > decrease the Mem using for variable 'tasks' in function resourceOffers > -- > > Key: SPARK-21225 > URL: https://issues.apache.org/jira/browse/SPARK-21225 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: yangZhiguo >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > In the function 'resourceOffers', It declare a variable 'tasks' for > storage the tasks which have allocated a executor. It declared like this: > *{color:#d04437}val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](o.cores)){color}* > But, I think this code only conside a situation for that one task per core. > If the user config the "spark.task.cpus" as 2 or 3, It really don't need so > much space. I think It can motify as follow: > {color:#14892c}*val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-14657. - Resolution: Fixed Fix Version/s: 2.3.0 > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.3.0 > > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer
[ https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21222. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18429 [https://github.com/apache/spark/pull/18429] > Move elimination of Distinct clause from analyzer to optimizer > -- > > Key: SPARK-21222 > URL: https://issues.apache.org/jira/browse/SPARK-21222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > Fix For: 2.3.0 > > > Move elimination of Distinct clause from analyzer to optimizer > Distinct clause is useless after MAX/MIN clause. For example, > "Select MAX(distinct a) FROM src from" > is equivalent of > "Select MAX(a) FROM src from" > However, this optimization is implemented in analyzer. It should be in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer
[ https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21222: --- Assignee: Gengliang Wang > Move elimination of Distinct clause from analyzer to optimizer > -- > > Key: SPARK-21222 > URL: https://issues.apache.org/jira/browse/SPARK-21222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 2.3.0 > > > Move elimination of Distinct clause from analyzer to optimizer > Distinct clause is useless after MAX/MIN clause. For example, > "Select MAX(distinct a) FROM src from" > is equivalent of > "Select MAX(a) FROM src from" > However, this optimization is implemented in analyzer. It should be in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml
[ https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067494#comment-16067494 ] yuhao yang commented on SPARK-18441: Move the Smote code to https://gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b > Add Smote in spark mlib and ml > -- > > Key: SPARK-18441 > URL: https://issues.apache.org/jira/browse/SPARK-18441 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 2.0.1 >Reporter: lichenglin > > PLZ Add Smote in spark mlib and ml in case of the "not balance of train > data" for Classification -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)
[ https://issues.apache.org/jira/browse/SPARK-21248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067425#comment-16067425 ] Apache Spark commented on SPARK-21248: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18461 > Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets > (failOnDataLoss: true) > --- > > Key: SPARK-21248 > URL: https://issues.apache.org/jira/browse/SPARK-21248 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu > > {code} > org.scalatest.exceptions.TestFailedException: Stream Thread Died: null > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) >== Progress ==AssertOnQuery(, )CheckAnswer: > [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream > StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map()) > CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22] > AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), > message = )CheckAnswer: > [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34] > StopStream == Stream == Output Mode: Append Stream state: not started Thread > state: dead java.lang.InterruptedException at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) >== Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] > [31] [32] [34] == Plan == > {code} > See > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)
[ https://issues.apache.org/jira/browse/SPARK-21248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21248: Assignee: Apache Spark > Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets > (failOnDataLoss: true) > --- > > Key: SPARK-21248 > URL: https://issues.apache.org/jira/browse/SPARK-21248 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > {code} > org.scalatest.exceptions.TestFailedException: Stream Thread Died: null > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) >== Progress ==AssertOnQuery(, )CheckAnswer: > [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream > StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map()) > CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22] > AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), > message = )CheckAnswer: > [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34] > StopStream == Stream == Output Mode: Append Stream state: not started Thread > state: dead java.lang.InterruptedException at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) >== Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] > [31] [32] [34] == Plan == > {code} > See > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)
[ https://issues.apache.org/jira/browse/SPARK-21248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21248: Assignee: (was: Apache Spark) > Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets > (failOnDataLoss: true) > --- > > Key: SPARK-21248 > URL: https://issues.apache.org/jira/browse/SPARK-21248 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu > > {code} > org.scalatest.exceptions.TestFailedException: Stream Thread Died: null > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) >== Progress ==AssertOnQuery(, )CheckAnswer: > [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream > StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map()) > CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22] > AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), > message = )CheckAnswer: > [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34] > StopStream == Stream == Output Mode: Append Stream state: not started Thread > state: dead java.lang.InterruptedException at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) >== Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] > [31] [32] [34] == Plan == > {code} > See > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)
Shixiong Zhu created SPARK-21248: Summary: Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true) Key: SPARK-21248 URL: https://issues.apache.org/jira/browse/SPARK-21248 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Shixiong Zhu {code} org.scalatest.exceptions.TestFailedException: Stream Thread Died: null java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) == Progress ==AssertOnQuery(, )CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map()) CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22] AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), message = )CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34] StopStream == Stream == Output Mode: Append Stream state: not started Thread state: dead java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) at org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) == Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] [31] [32] [34] == Plan == {code} See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21247) Allow case-insensitive type equality in Set operation
[ https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21247: Assignee: (was: Apache Spark) > Allow case-insensitive type equality in Set operation > - > > Key: SPARK-21247 > URL: https://issues.apache.org/jira/browse/SPARK-21247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Dongjoon Hyun > > Spark supports case-sensitivity in columns. Especially, for Struct types, > with case sensitive option, the following is supported. > {code} > scala> sql("select named_struct('a', 1, 'A', 2).a").show > +--+ > |named_struct(a, 1, A, 2).a| > +--+ > | 1| > +--+ > scala> sql("select named_struct('a', 1, 'A', 2).A").show > +--+ > |named_struct(a, 1, A, 2).A| > +--+ > | 2| > +--+ > {code} > And vice versa, with case sensitive `false`, the following is supported. > {code} > scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show > +++ > |named_struct(a, 1).A|named_struct(A, 1).a| > +++ > | 1| 1| > +++ > {code} > This issue aims to support case-insensitive type comparisions in Set > operation. Currently, SET operations fail due to case-sensitive type > comparision failure . > {code} > scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. struct <> struct at the first > column of the second table;; > 'Union > :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2] > : +- OneRowRelation$ > +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3] >+- OneRowRelation$ > {code} > Please note that this issue does not aim to change all type comparison > semantics. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21247) Allow case-insensitive type equality in Set operation
[ https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067379#comment-16067379 ] Apache Spark commented on SPARK-21247: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/18460 > Allow case-insensitive type equality in Set operation > - > > Key: SPARK-21247 > URL: https://issues.apache.org/jira/browse/SPARK-21247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Dongjoon Hyun > > Spark supports case-sensitivity in columns. Especially, for Struct types, > with case sensitive option, the following is supported. > {code} > scala> sql("select named_struct('a', 1, 'A', 2).a").show > +--+ > |named_struct(a, 1, A, 2).a| > +--+ > | 1| > +--+ > scala> sql("select named_struct('a', 1, 'A', 2).A").show > +--+ > |named_struct(a, 1, A, 2).A| > +--+ > | 2| > +--+ > {code} > And vice versa, with case sensitive `false`, the following is supported. > {code} > scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show > +++ > |named_struct(a, 1).A|named_struct(A, 1).a| > +++ > | 1| 1| > +++ > {code} > This issue aims to support case-insensitive type comparisions in Set > operation. Currently, SET operations fail due to case-sensitive type > comparision failure . > {code} > scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. struct <> struct at the first > column of the second table;; > 'Union > :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2] > : +- OneRowRelation$ > +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3] >+- OneRowRelation$ > {code} > Please note that this issue does not aim to change all type comparison > semantics. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21247) Allow case-insensitive type equality in Set operation
[ https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21247: Assignee: Apache Spark > Allow case-insensitive type equality in Set operation > - > > Key: SPARK-21247 > URL: https://issues.apache.org/jira/browse/SPARK-21247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > Spark supports case-sensitivity in columns. Especially, for Struct types, > with case sensitive option, the following is supported. > {code} > scala> sql("select named_struct('a', 1, 'A', 2).a").show > +--+ > |named_struct(a, 1, A, 2).a| > +--+ > | 1| > +--+ > scala> sql("select named_struct('a', 1, 'A', 2).A").show > +--+ > |named_struct(a, 1, A, 2).A| > +--+ > | 2| > +--+ > {code} > And vice versa, with case sensitive `false`, the following is supported. > {code} > scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show > +++ > |named_struct(a, 1).A|named_struct(A, 1).a| > +++ > | 1| 1| > +++ > {code} > This issue aims to support case-insensitive type comparisions in Set > operation. Currently, SET operations fail due to case-sensitive type > comparision failure . > {code} > scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. struct <> struct at the first > column of the second table;; > 'Union > :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2] > : +- OneRowRelation$ > +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3] >+- OneRowRelation$ > {code} > Please note that this issue does not aim to change all type comparison > semantics. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21247) Allow case-insensitive type equality in Set operation
[ https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21247: -- Summary: Allow case-insensitive type equality in Set operation (was: Allow case-insensitive type comparisions in Set operation) > Allow case-insensitive type equality in Set operation > - > > Key: SPARK-21247 > URL: https://issues.apache.org/jira/browse/SPARK-21247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Dongjoon Hyun > > Spark supports case-sensitivity in columns. Especially, for Struct types, > with case sensitive option, the following is supported. > {code} > scala> sql("select named_struct('a', 1, 'A', 2).a").show > +--+ > |named_struct(a, 1, A, 2).a| > +--+ > | 1| > +--+ > scala> sql("select named_struct('a', 1, 'A', 2).A").show > +--+ > |named_struct(a, 1, A, 2).A| > +--+ > | 2| > +--+ > {code} > And vice versa, with case sensitive `false`, the following is supported. > {code} > scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show > +++ > |named_struct(a, 1).A|named_struct(A, 1).a| > +++ > | 1| 1| > +++ > {code} > This issue aims to support case-insensitive type comparisions in Set > operation. Currently, SET operations fail due to case-sensitive type > comparision failure . > {code} > scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. struct <> struct at the first > column of the second table;; > 'Union > :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2] > : +- OneRowRelation$ > +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3] >+- OneRowRelation$ > {code} > Please note that this issue does not aim to change all type comparison > semantics. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21247) Allow case-insensitive type comparisions in Set operation
Dongjoon Hyun created SPARK-21247: - Summary: Allow case-insensitive type comparisions in Set operation Key: SPARK-21247 URL: https://issues.apache.org/jira/browse/SPARK-21247 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Dongjoon Hyun Spark supports case-sensitivity in columns. Especially, for Struct types, with case sensitive option, the following is supported. {code} scala> sql("select named_struct('a', 1, 'A', 2).a").show +--+ |named_struct(a, 1, A, 2).a| +--+ | 1| +--+ scala> sql("select named_struct('a', 1, 'A', 2).A").show +--+ |named_struct(a, 1, A, 2).A| +--+ | 2| +--+ {code} And vice versa, with case sensitive `false`, the following is supported. {code} scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show +++ |named_struct(a, 1).A|named_struct(A, 1).a| +++ | 1| 1| +++ {code} This issue aims to support case-insensitive type comparisions in Set operation. Currently, SET operations fail due to case-sensitive type comparision failure . {code} scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct <> struct at the first column of the second table;; 'Union :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2] : +- OneRowRelation$ +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3] +- OneRowRelation$ {code} Please note that this issue does not aim to change all type comparison semantics. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled
[ https://issues.apache.org/jira/browse/SPARK-21242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067257#comment-16067257 ] John Leach commented on SPARK-21242: [~mgummelt] We are using this for our service and it seems to be working integrated with Calico. Thanks for SPARK-18232 which gave us a bit of guidance. > Allow spark executors to function in mesos w/ container networking enabled > -- > > Key: SPARK-21242 > URL: https://issues.apache.org/jira/browse/SPARK-21242 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.1 >Reporter: Tara Gildersleeve > Attachments: patch_1.patch > > > Allow spark executors to function in mesos w/ container networking enabled -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n
[ https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067167#comment-16067167 ] Andrew Ray commented on SPARK-21184: Also the lookup queries are just wrong {code} scala> Seq(1, 2).toDF("a").selectExpr("percentile_approx(a, 0.001)").head res9: org.apache.spark.sql.Row = [2.0] {code} > QuantileSummaries implementation is wrong and QuantileSummariesSuite fails > with larger n > > > Key: SPARK-21184 > URL: https://issues.apache.org/jira/browse/SPARK-21184 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Andrew Ray > > 1. QuantileSummaries implementation does not match the paper it is supposed > to be based on. > 1a. The compress method > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240) > merges neighboring buckets, but thats not what the paper says to do. The > paper > (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) > describes an implicit tree structure and the compress method deletes selected > subtrees. > 1b. The paper does not discuss merging these summary data structures at all. > The following comment is in the merge method of QuantileSummaries: > {quote} // The GK algorithm is a bit unclear about it, but it seems > there is no need to adjust the > // statistics during the merging: the invariants are still respected > after the merge.{quote} > Unless I'm missing something that needs substantiation, it's not clear that > that the invariants hold. > 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values) > https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27 > One possible solution if these issues can't be resolved would be to move to > an algorithm that explicitly supports merging and is well tested like > https://github.com/tdunning/t-digest -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
Monica Raj created SPARK-21246: -- Summary: Unexpected Data Type conversion from LONG to BIGINT Key: SPARK-21246 URL: https://issues.apache.org/jira/browse/SPARK-21246 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Environment: Using Zeppelin Notebook or Spark Shell Reporter: Monica Raj The unexpected conversion occurred when creating a data frame out of an existing data collection. The following code can be run in zeppelin notebook to reproduce the bug: import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schemaString = "name" val lstVals = Seq(3) val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) rowRdd.collect() // Generate the schema based on the string of schema val fields = schemaString.split(" ") .map(fieldName => StructField(fieldName, LongType, nullable = true)) val schema = StructType(fields) print(schema) val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21245) Resolve code duplication for classification/regression summarizers
Seth Hendrickson created SPARK-21245: Summary: Resolve code duplication for classification/regression summarizers Key: SPARK-21245 URL: https://issues.apache.org/jira/browse/SPARK-21245 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.2.1 Reporter: Seth Hendrickson In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary information about training data using {{MultivariateOnlineSummarizer}} and {{MulticlassSummarizer}}. We have the same code appearing in several places (and including test suites). We can eliminate this by creating a common implementation somewhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066938#comment-16066938 ] Apache Spark commented on SPARK-13534: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/18459 > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Wes McKinney >Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21216) Streaming DataFrames fail to join with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21216. -- Resolution: Fixed Fix Version/s: 2.3.0 > Streaming DataFrames fail to join with Hive tables > -- > > Key: SPARK-21216 > URL: https://issues.apache.org/jira/browse/SPARK-21216 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.3.0 > > > The following code will throw a cryptic exception: > {code} > import org.apache.spark.sql.execution.streaming.MemoryStream > import testImplicits._ > implicit val _sqlContext = spark.sqlContext > Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", > "word").createOrReplaceTempView("t1") > // Make a table and ensure it will be broadcast. > sql("""CREATE TABLE smallTable(word string, number int) > |ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > |STORED AS TEXTFILE > """.stripMargin) > sql( > """INSERT INTO smallTable > |SELECT word, number from t1 > """.stripMargin) > val inputData = MemoryStream[Int] > val joined = inputData.toDS().toDF() > .join(spark.table("smallTable"), $"value" === $"number") > val sq = joined.writeStream > .format("memory") > .queryName("t2") > .start() > try { > inputData.addData(1, 2) > sq.processAllAvailable() > } finally { > sq.stop() > } > {code} > If someone creates a HiveSession, the planner in `IncrementalExecution` > doesn't take into account the Hive scan strategies -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20889) SparkR grouped documentation for Column methods
[ https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066910#comment-16066910 ] Apache Spark commented on SPARK-20889: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/18458 > SparkR grouped documentation for Column methods > --- > > Key: SPARK-20889 > URL: https://issues.apache.org/jira/browse/SPARK-20889 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: documentation > Fix For: 2.3.0 > > > Group the documentation of individual methods defined for the Column class. > This aims to create the following improvements: > - Centralized documentation for easy navigation (user can view multiple > related methods on one single page). > - Reduced number of items in Seealso. > - Betters examples using shared data. This avoids creating a data frame for > each function if they are documented separately. And more importantly, user > can copy and paste to run them directly! > - Cleaner structure and much fewer Rd files (remove a large number of Rd > files). > - Remove duplicated definition of param (since they share exactly the same > argument). > - No need to write meaningless examples for trivial functions (because of > grouping). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21241: Assignee: Apache Spark > Add intercept to StreamingLinearRegressionWithSGD > - > > Key: SPARK-21241 > URL: https://issues.apache.org/jira/browse/SPARK-21241 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Soulaimane GUEDRIA >Assignee: Apache Spark > Labels: patch > Fix For: 2.3.0 > > > StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept > Method which offers the possibility to turn on/off the intercept value. API > parity is not respected between Python and Scala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066893#comment-16066893 ] Apache Spark commented on SPARK-21241: -- User 'SoulGuedria' has created a pull request for this issue: https://github.com/apache/spark/pull/18457 > Add intercept to StreamingLinearRegressionWithSGD > - > > Key: SPARK-21241 > URL: https://issues.apache.org/jira/browse/SPARK-21241 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Soulaimane GUEDRIA > Labels: patch > Fix For: 2.3.0 > > > StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept > Method which offers the possibility to turn on/off the intercept value. API > parity is not respected between Python and Scala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21241: Assignee: (was: Apache Spark) > Add intercept to StreamingLinearRegressionWithSGD > - > > Key: SPARK-21241 > URL: https://issues.apache.org/jira/browse/SPARK-21241 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Soulaimane GUEDRIA > Labels: patch > Fix For: 2.3.0 > > > StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept > Method which offers the possibility to turn on/off the intercept value. API > parity is not respected between Python and Scala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster
[ https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066833#comment-16066833 ] Sean Owen commented on SPARK-21244: --- There's no detail here that suggests a Spark bug. Depending on your docs and your k, this might be correct. > KMeans applied to processed text day clumps almost all documents into one > cluster > - > > Key: SPARK-21244 > URL: https://issues.apache.org/jira/browse/SPARK-21244 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Nassir > > I have observed this problem for quite a while now regarding the > implementation of pyspark KMeans on text documents - to cluster documents > according to their TF-IDF vectors. The pyspark implementation - even on > standard datasets - clusters almost all of the documents into one cluster. > I implemented K-means on the same dataset with same parameters using SKlearn > library, and this clusters the documents very well. > I recommend anyone who is able to test the pyspark implementation of KMeans > on text documents - which obviously has a bug in it somewhere. > (currently I am convert my spark dataframe to pandas dataframe and running k > means and converting back. However, this is of course not a parallel solution > capable of handling huge amounts of data in future) > Here is a link to the question i posted a while back on stackoverlfow: > https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066816#comment-16066816 ] Mathieu DESPRIEE commented on SPARK-20082: -- I updated the PR. Basically, here is the approach : - only Online optimizer is supported, any use with EM optimizer is rejected. If incremental is also desirable for EM, I suggest we open another JIRA for it, to take the time discussing the initialization with an existing graph and new documents. - I added an {{initialModel}} parameter that is used to initialize doc concentration and topic matrix from it. [~yuhaoyan], could you check it please ? > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu DESPRIEE > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster
[ https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nassir updated SPARK-21244: --- Description: I have observed this problem for quite a while now regarding the implementation of pyspark KMeans on text documents - to cluster documents according to their TF-IDF vectors. The pyspark implementation - even on standard datasets - clusters almost all of the documents into one cluster. I implemented K-means on the same dataset with same parameters using SKlearn library, and this clusters the documents very well. I recommend anyone who is able to test the pyspark implementation of KMeans on text documents - which obviously has a bug in it somewhere. (currently I am convert my spark dataframe to pandas dataframe and running k means and converting back. However, this is of course not a parallel solution capable of handling huge amounts of data in future) Here is a link to the question i posted a while back on stackoverlfow: https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one was: I have observed this problem for quite a while now regarding the implementation of pyspark KMeans on text documents - to cluster documents according to their TF-IDF vectors. The pyspark implementation - even on standard datasets - clusters almost all of the documents into one cluster. I implemented K-means on the same dataset with same parameters using SKlearn library, and this clusters the documents very well. I recommend anyone who is able to test the pyspark implementation of KMeans on text documents - which obviously has a bug in it somewhere. (currently I am convert my spark dataframe to pandas dataframe and running k means and converting back. However, this is of course not a parallel solution capable of handling huge amounts of data in future) > KMeans applied to processed text day clumps almost all documents into one > cluster > - > > Key: SPARK-21244 > URL: https://issues.apache.org/jira/browse/SPARK-21244 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Nassir > > I have observed this problem for quite a while now regarding the > implementation of pyspark KMeans on text documents - to cluster documents > according to their TF-IDF vectors. The pyspark implementation - even on > standard datasets - clusters almost all of the documents into one cluster. > I implemented K-means on the same dataset with same parameters using SKlearn > library, and this clusters the documents very well. > I recommend anyone who is able to test the pyspark implementation of KMeans > on text documents - which obviously has a bug in it somewhere. > (currently I am convert my spark dataframe to pandas dataframe and running k > means and converting back. However, this is of course not a parallel solution > capable of handling huge amounts of data in future) > Here is a link to the question i posted a while back on stackoverlfow: > https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster
Nassir created SPARK-21244: -- Summary: KMeans applied to processed text day clumps almost all documents into one cluster Key: SPARK-21244 URL: https://issues.apache.org/jira/browse/SPARK-21244 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.1 Reporter: Nassir I have observed this problem for quite a while now regarding the implementation of pyspark KMeans on text documents - to cluster documents according to their TF-IDF vectors. The pyspark implementation - even on standard datasets - clusters almost all of the documents into one cluster. I implemented K-means on the same dataset with same parameters using SKlearn library, and this clusters the documents very well. I recommend anyone who is able to test the pyspark implementation of KMeans on text documents - which obviously has a bug in it somewhere. (currently I am convert my spark dataframe to pandas dataframe and running k means and converting back. However, this is of course not a parallel solution capable of handling huge amounts of data in future) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
[ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066652#comment-16066652 ] Nassir commented on SPARK-20696: Unfortunately, I have not found a place to make this known to the spark community yet. My workaround has been to convert pyspark dataframe to pandas dataframe, use sklearn python K-Means to cluster documents (which works well), then convert pandas dataframe back to pyspark. It works in my situation as the number of documents I am clustering is relatively small. However, I will want to process Big Data and would need a solution in pyspark with spark streaming in fuutre Nassir > tf-idf document clustering with K-means in Apache Spark putting points into > one cluster > --- > > Key: SPARK-20696 > URL: https://issues.apache.org/jira/browse/SPARK-20696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nassir > > I am trying to do the classic job of clustering text documents by > pre-processing, generating tf-idf matrix, and then applying K-means. However, > testing this workflow on the classic 20NewsGroup dataset results in most > documents being clustered into one cluster. (I have initially tried to > cluster all documents from 6 of the 20 groups - so expecting to cluster into > 6 clusters). > I am implementing this in Apache Spark as my purpose is to utilise this > technique on millions of documents. Here is the code written in Pyspark on > Databricks: > #declare path to folder containing 6 of 20 news group categories > path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % > MOUNT_NAME > #read all the text files from the 6 folders. Each entity is an entire > document. > text_files = sc.wholeTextFiles(path).cache() > #convert rdd to dataframe > df = text_files.toDF(["filePath", "document"]).cache() > from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer > #tokenize the document text > tokenizer = Tokenizer(inputCol="document", outputCol="tokens") > tokenized = tokenizer.transform(df).cache() > from pyspark.ml.feature import StopWordsRemover > remover = StopWordsRemover(inputCol="tokens", > outputCol="stopWordsRemovedTokens") > stopWordsRemoved_df = remover.transform(tokenized).cache() > hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", > outputCol="rawFeatures", numFeatures=20) > tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() > idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) > idfModel = idf.fit(tfVectors) > tfIdfVectors = idfModel.transform(tfVectors).cache() > #note that I have also tried to use normalized data, but get the same result > from pyspark.ml.feature import Normalizer > from pyspark.ml.linalg import Vectors > normalizer = Normalizer(inputCol="features", outputCol="normFeatures") > l2NormData = normalizer.transform(tfIdfVectors) > from pyspark.ml.clustering import KMeans > # Trains a KMeans model. > kmeans = KMeans().setK(6).setMaxIter(20) > km_model = kmeans.fit(l2NormData) > clustersTable = km_model.transform(l2NormData) > [output showing most documents get clustered into cluster 0][1] > ID number_of_documents_in_cluster > 0 3024 > 3 5 > 1 3 > 5 2 > 2 2 > 4 1 > As you can see most of my data points get clustered into cluster 0, and I > cannot figure out what I am doing wrong as all the tutorials and code I have > come across online point to using this method. > In addition I have also tried normalizing the tf-idf matrix before K-means > but that also produces the same result. I know cosine distance is a better > measure to use, but I expected using standard K-means in Apache Spark would > provide meaningful results. > Can anyone help with regards to whether I have a bug in my code, or if > something is missing in my data clustering pipeline? > (Question also asked in Stackoverflow before: > http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) > Thank you in advance! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Soulaimane GUEDRIA updated SPARK-21241: --- Description: StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept Method which offers the possibility to turn on/off the intercept value. API parity is not respected between Python and Scala. (was: StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept Method which offers the possibility to turn on/off the intercept value. API parity is not achieved with Scala API.) > Add intercept to StreamingLinearRegressionWithSGD > - > > Key: SPARK-21241 > URL: https://issues.apache.org/jira/browse/SPARK-21241 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Soulaimane GUEDRIA > Labels: patch > Fix For: 2.3.0 > > > StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept > Method which offers the possibility to turn on/off the intercept value. API > parity is not respected between Python and Scala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Soulaimane GUEDRIA updated SPARK-21241: --- Summary: Add intercept to StreamingLinearRegressionWithSGD (was: Can't add intercept to StreamingLinearRegressionWithSGD) > Add intercept to StreamingLinearRegressionWithSGD > - > > Key: SPARK-21241 > URL: https://issues.apache.org/jira/browse/SPARK-21241 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Soulaimane GUEDRIA > Labels: patch > Fix For: 2.3.0 > > > StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept > Method which offers the possibility to turn on/off the intercept value. API > parity is not achieved with Scala API. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled
[ https://issues.apache.org/jira/browse/SPARK-21242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tara Gildersleeve updated SPARK-21242: -- Priority: Major (was: Minor) > Allow spark executors to function in mesos w/ container networking enabled > -- > > Key: SPARK-21242 > URL: https://issues.apache.org/jira/browse/SPARK-21242 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.1 >Reporter: Tara Gildersleeve > Attachments: patch_1.patch > > > Allow spark executors to function in mesos w/ container networking enabled -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066517#comment-16066517 ] Cody Koeninger commented on SPARK-21233: You already have the choice of where you want to store offsets. http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21243) Limit the number of maps in a single shuffle fetch
Dhruve Ashar created SPARK-21243: Summary: Limit the number of maps in a single shuffle fetch Key: SPARK-21243 URL: https://issues.apache.org/jira/browse/SPARK-21243 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1, 2.1.0 Reporter: Dhruve Ashar Priority: Minor Right now spark can limit the # of parallel fetches and also limits the amount of data in one fetch, but one fetch to a host could be for 100's of blocks. In one instance we saw 450+ blocks. When you have 100's of those and 1000's of reducers fetching that becomes a lot of metadata and can run the Node Manager out of memory. We should add a config to limit the # of maps per fetch to reduce the load on the NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled
[ https://issues.apache.org/jira/browse/SPARK-21242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tara Gildersleeve updated SPARK-21242: -- Attachment: patch_1.patch > Allow spark executors to function in mesos w/ container networking enabled > -- > > Key: SPARK-21242 > URL: https://issues.apache.org/jira/browse/SPARK-21242 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.1 >Reporter: Tara Gildersleeve >Priority: Minor > Attachments: patch_1.patch > > > Allow spark executors to function in mesos w/ container networking enabled -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled
Tara Gildersleeve created SPARK-21242: - Summary: Allow spark executors to function in mesos w/ container networking enabled Key: SPARK-21242 URL: https://issues.apache.org/jira/browse/SPARK-21242 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 2.1.1 Reporter: Tara Gildersleeve Priority: Minor Allow spark executors to function in mesos w/ container networking enabled -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21241) Can't add intercept to StreamingLinearRegressionWithSGD
Soulaimane GUEDRIA created SPARK-21241: -- Summary: Can't add intercept to StreamingLinearRegressionWithSGD Key: SPARK-21241 URL: https://issues.apache.org/jira/browse/SPARK-21241 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 2.3.0 Reporter: Soulaimane GUEDRIA Fix For: 2.3.0 StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept Method which offers the possibility to turn on/off the intercept value. API parity is not achieved with Scala API. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe
[ https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066471#comment-16066471 ] Sean Owen commented on SPARK-21227: --- Yes, I think this is ultimately related to two different locales disagreeing on what the lower- or upper-case of the string it. This Turkish character is the most common trigger. We addressed this in Spark a while ago. However we left most any string that is generated by a user application untouched, on the assumption they may want locale-specific behavior (and to lessen the scope of the change). This behavior isn't expected, even if it's a somewhat contrived example you can work around. Something is doing a locale-insensitive toLowerCase, which maybe shouldn't. > Unicode in Json field causes AnalysisException when selecting from Dataframe > > > Key: SPARK-21227 > URL: https://issues.apache.org/jira/browse/SPARK-21227 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Seydou Dia > > Hi, > please find below the step to reproduce the issue I am facing. > First I create a json with 2 fields: > * city_name > * cıty_name > The first one is valid ascii, while the second contains a unicode (ı, i > without dot ). > When I try to select from the dataframe I have an {noformat} > AnalysisException {noformat}. > {code:python} > $ pyspark > Python 3.4.3 (default, Sep 1 2016, 23:33:38) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. > Attempting port 4042. > 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 3.4.3 (default, Sep 1 2016 23:33:38) > SparkSession available as 'spark'. > >>> sc=spark.sparkContext > >>> js = ['{"city_name": "paris"}' > ... , '{"city_name": "rome"}' > ... , '{"city_name": "berlin"}' > ... , '{"cıty_name": "new-york"}' > ... , '{"cıty_name": "toronto"}' > ... , '{"cıty_name": "chicago"}' > ... , '{"cıty_name": "dubai"}'] > >>> myRDD = sc.parallelize(js) > >>> myDF = spark.read.json(myRDD) > >>> myDF.printSchema() > >>> > root > |-- city_name: string (nullable = true) > |-- cıty_name: string (nullable = true) > >>> myDF.select(myDF['city_name']) > Traceback (most recent call last): > File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line > 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply. > : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, > could be: city_name#29, city_name#30.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1073) > at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in > __getitem__ > jc = self._jdf.apply(item) > File "/u
[jira] [Assigned] (SPARK-21228) InSet incorrect handling of structs
[ https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21228: Assignee: Apache Spark > InSet incorrect handling of structs > --- > > Key: SPARK-21228 > URL: https://issues.apache.org/jira/browse/SPARK-21228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Apache Spark > > In InSet it's possible that hset contains GenericInternalRows while child > returns UnsafeRows (and vice versa). InSet uses hset.contains (both in > doCodeGen and eval) which will always be false in this case. > The following code reproduces the problem: > {code} > spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the > default is 10 which requires a longer query text to repro > spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as > a").createOrReplaceTempView("A") > sql("select * from (select min(a) as minA from A) A where minA in > (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', > 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return > UnsafeRows while the list of structs that will become hset will be > GenericInternalRows > ++ > |minA| > ++ > ++ > {code} > In.doCodeGen uses compareStructs and seems to work. In.eval might not work > but not sure how to reproduce. > {code} > spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it > will not use InSet > sql("select * from (select min(a) as minA from A) A where minA in > (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', > 2L),named_struct('a', 3L, 'b', 3L))").show > +-+ > | minA| > +-+ > |[1,1]| > +-+ > {code} > Solution could be either to do safe<->unsafe conversion in InSet or not > trigger InSet optimization at all in this case. > Need to investigate if In.eval is affected. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21228) InSet incorrect handling of structs
[ https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21228: Assignee: (was: Apache Spark) > InSet incorrect handling of structs > --- > > Key: SPARK-21228 > URL: https://issues.apache.org/jira/browse/SPARK-21228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu > > In InSet it's possible that hset contains GenericInternalRows while child > returns UnsafeRows (and vice versa). InSet uses hset.contains (both in > doCodeGen and eval) which will always be false in this case. > The following code reproduces the problem: > {code} > spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the > default is 10 which requires a longer query text to repro > spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as > a").createOrReplaceTempView("A") > sql("select * from (select min(a) as minA from A) A where minA in > (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', > 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return > UnsafeRows while the list of structs that will become hset will be > GenericInternalRows > ++ > |minA| > ++ > ++ > {code} > In.doCodeGen uses compareStructs and seems to work. In.eval might not work > but not sure how to reproduce. > {code} > spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it > will not use InSet > sql("select * from (select min(a) as minA from A) A where minA in > (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', > 2L),named_struct('a', 3L, 'b', 3L))").show > +-+ > | minA| > +-+ > |[1,1]| > +-+ > {code} > Solution could be either to do safe<->unsafe conversion in InSet or not > trigger InSet optimization at all in this case. > Need to investigate if In.eval is affected. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21228) InSet incorrect handling of structs
[ https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066457#comment-16066457 ] Apache Spark commented on SPARK-21228: -- User 'bogdanrdc' has created a pull request for this issue: https://github.com/apache/spark/pull/18455 > InSet incorrect handling of structs > --- > > Key: SPARK-21228 > URL: https://issues.apache.org/jira/browse/SPARK-21228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu > > In InSet it's possible that hset contains GenericInternalRows while child > returns UnsafeRows (and vice versa). InSet uses hset.contains (both in > doCodeGen and eval) which will always be false in this case. > The following code reproduces the problem: > {code} > spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the > default is 10 which requires a longer query text to repro > spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as > a").createOrReplaceTempView("A") > sql("select * from (select min(a) as minA from A) A where minA in > (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', > 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return > UnsafeRows while the list of structs that will become hset will be > GenericInternalRows > ++ > |minA| > ++ > ++ > {code} > In.doCodeGen uses compareStructs and seems to work. In.eval might not work > but not sure how to reproduce. > {code} > spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it > will not use InSet > sql("select * from (select min(a) as minA from A) A where minA in > (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', > 2L),named_struct('a', 3L, 'b', 3L))").show > +-+ > | minA| > +-+ > |[1,1]| > +-+ > {code} > Solution could be either to do safe<->unsafe conversion in InSet or not > trigger InSet optimization at all in this case. > Need to investigate if In.eval is affected. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe
[ https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066423#comment-16066423 ] Hyukjin Kwon commented on SPARK-21227: -- I took a quick look and the cause seems to be here - https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala#L36 . It appears that this compares both strings ignoring Turkish locale. I was thinking doing toLowercase comparison but I guess we should investigate if such change introduces other corner cases of behaviour changes. > Unicode in Json field causes AnalysisException when selecting from Dataframe > > > Key: SPARK-21227 > URL: https://issues.apache.org/jira/browse/SPARK-21227 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Seydou Dia > > Hi, > please find below the step to reproduce the issue I am facing. > First I create a json with 2 fields: > * city_name > * cıty_name > The first one is valid ascii, while the second contains a unicode (ı, i > without dot ). > When I try to select from the dataframe I have an {noformat} > AnalysisException {noformat}. > {code:python} > $ pyspark > Python 3.4.3 (default, Sep 1 2016, 23:33:38) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. > Attempting port 4042. > 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 3.4.3 (default, Sep 1 2016 23:33:38) > SparkSession available as 'spark'. > >>> sc=spark.sparkContext > >>> js = ['{"city_name": "paris"}' > ... , '{"city_name": "rome"}' > ... , '{"city_name": "berlin"}' > ... , '{"cıty_name": "new-york"}' > ... , '{"cıty_name": "toronto"}' > ... , '{"cıty_name": "chicago"}' > ... , '{"cıty_name": "dubai"}'] > >>> myRDD = sc.parallelize(js) > >>> myDF = spark.read.json(myRDD) > >>> myDF.printSchema() > >>> > root > |-- city_name: string (nullable = true) > |-- cıty_name: string (nullable = true) > >>> myDF.select(myDF['city_name']) > Traceback (most recent call last): > File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line > 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply. > : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, > could be: city_name#29, city_name#30.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1073) > at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in > __getitem__ > jc = self._jdf.apply(item) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File "/usr/lib/spark/python/pyspark/sql
[jira] [Commented] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066363#comment-16066363 ] Michael Styles commented on SPARK-17091: n Parquet 1.7, there as a bug involving corrupt statistics on binary columns (https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier versions of Spark from generating Parquet filters on any string columns. Spark 2.1 has moved up to Parquet 1.8.2, so this issue no longer exists. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT
[ https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21240: Assignee: Apache Spark > Fix code style for constructing and stopping a SparkContext in UT > - > > Key: SPARK-21240 > URL: https://issues.apache.org/jira/browse/SPARK-21240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Assignee: Apache Spark >Priority: Trivial > > Related to SPARK-20985. > Fix code style for constructing and stopping a SparkContext. Assure the > context is stopped to avoid other tests complain that there's only one > SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT
[ https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066360#comment-16066360 ] Apache Spark commented on SPARK-21240: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/18454 > Fix code style for constructing and stopping a SparkContext in UT > - > > Key: SPARK-21240 > URL: https://issues.apache.org/jira/browse/SPARK-21240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Priority: Trivial > > Related to SPARK-20985. > Fix code style for constructing and stopping a SparkContext. Assure the > context is stopped to avoid other tests complain that there's only one > SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT
[ https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21240: Assignee: (was: Apache Spark) > Fix code style for constructing and stopping a SparkContext in UT > - > > Key: SPARK-21240 > URL: https://issues.apache.org/jira/browse/SPARK-21240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Priority: Trivial > > Related to SPARK-20985. > Fix code style for constructing and stopping a SparkContext. Assure the > context is stopped to avoid other tests complain that there's only one > SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT
jin xing created SPARK-21240: Summary: Fix code style for constructing and stopping a SparkContext in UT Key: SPARK-21240 URL: https://issues.apache.org/jira/browse/SPARK-21240 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: jin xing Priority: Trivial Related to SPARK-20985. Fix code style for constructing and stopping a SparkContext. Assure the context is stopped to avoid other tests complain that there's only one SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21223) Thread-safety issue in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066319#comment-16066319 ] zenglinxi edited comment on SPARK-21223 at 6/28/17 10:42 AM: - [~sowen] ok, i will check SPARK-21078 first. was (Author: gostop_zlx): ok, i will check SPARK-21078 first. > Thread-safety issue in FsHistoryProvider > - > > Key: SPARK-21223 > URL: https://issues.apache.org/jira/browse/SPARK-21223 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: zenglinxi > > Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class > FsHistoryProvider to store the map of eventlog path and attemptInfo. > When use ThreadPool to Replay the log files in the list and merge the list of > old applications with new ones, multi thread may update fileToAppInfo at the > same time, which may cause Thread-safety issues. > {code:java} > for (file <- logInfos) { >tasks += replayExecutor.submit(new Runnable { > override def run(): Unit = mergeApplicationListing(file) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21223) Thread-safety issue in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066319#comment-16066319 ] zenglinxi commented on SPARK-21223: --- ok, i will check SPARK-21078 first. > Thread-safety issue in FsHistoryProvider > - > > Key: SPARK-21223 > URL: https://issues.apache.org/jira/browse/SPARK-21223 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: zenglinxi > > Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class > FsHistoryProvider to store the map of eventlog path and attemptInfo. > When use ThreadPool to Replay the log files in the list and merge the list of > old applications with new ones, multi thread may update fileToAppInfo at the > same time, which may cause Thread-safety issues. > {code:java} > for (file <- logInfos) { >tasks += replayExecutor.submit(new Runnable { > override def run(): Unit = mergeApplicationListing(file) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
[ https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-19852: --- Assignee: Vincent > StringIndexer.setHandleInvalid should have another option 'new': Python API > and docs > > > Key: SPARK-19852 > URL: https://issues.apache.org/jira/browse/SPARK-19852 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Vincent >Priority: Minor > > Update Python API for StringIndexer so setHandleInvalid doc is correct. This > will probably require: > * putting HandleInvalid within StringIndexer to update its built-in doc (See > Bucketizer for an example.) > * updating API docs and maybe the guide -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
[ https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066289#comment-16066289 ] Apache Spark commented on SPARK-19852: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/18453 > StringIndexer.setHandleInvalid should have another option 'new': Python API > and docs > > > Key: SPARK-19852 > URL: https://issues.apache.org/jira/browse/SPARK-19852 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Update Python API for StringIndexer so setHandleInvalid doc is correct. This > will probably require: > * putting HandleInvalid within StringIndexer to update its built-in doc (See > Bucketizer for an example.) > * updating API docs and maybe the guide -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21239) Support WAL recover in windows
[ https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21239: Assignee: Apache Spark > Support WAL recover in windows > -- > > Key: SPARK-21239 > URL: https://issues.apache.org/jira/browse/SPARK-21239 > Project: Spark > Issue Type: Bug > Components: DStreams, Windows >Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0 >Reporter: Yun Tang >Assignee: Apache Spark > Fix For: 2.1.2, 2.2.1 > > > When driver failed over, it will read WAL from HDFS by calling > WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a > dummy local path to satisfy the method parameter requirements, but the path > in windows will contain a colon which is not valid for hadoop. I removed the > potential driver letter and colon. > I found one email from spark-user ever talked about this bug > (https://www.mail-archive.com/user@spark.apache.org/msg55030.html) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21239) Support WAL recover in windows
[ https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066282#comment-16066282 ] Apache Spark commented on SPARK-21239: -- User 'Myasuka' has created a pull request for this issue: https://github.com/apache/spark/pull/18452 > Support WAL recover in windows > -- > > Key: SPARK-21239 > URL: https://issues.apache.org/jira/browse/SPARK-21239 > Project: Spark > Issue Type: Bug > Components: DStreams, Windows >Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0 >Reporter: Yun Tang > Fix For: 2.1.2, 2.2.1 > > > When driver failed over, it will read WAL from HDFS by calling > WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a > dummy local path to satisfy the method parameter requirements, but the path > in windows will contain a colon which is not valid for hadoop. I removed the > potential driver letter and colon. > I found one email from spark-user ever talked about this bug > (https://www.mail-archive.com/user@spark.apache.org/msg55030.html) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21239) Support WAL recover in windows
[ https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21239: Assignee: (was: Apache Spark) > Support WAL recover in windows > -- > > Key: SPARK-21239 > URL: https://issues.apache.org/jira/browse/SPARK-21239 > Project: Spark > Issue Type: Bug > Components: DStreams, Windows >Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0 >Reporter: Yun Tang > Fix For: 2.1.2, 2.2.1 > > > When driver failed over, it will read WAL from HDFS by calling > WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a > dummy local path to satisfy the method parameter requirements, but the path > in windows will contain a colon which is not valid for hadoop. I removed the > potential driver letter and colon. > I found one email from spark-user ever talked about this bug > (https://www.mail-archive.com/user@spark.apache.org/msg55030.html) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21239) Support WAL recover in windows
[ https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Tang updated SPARK-21239: - Description: When driver failed over, it will read WAL from HDFS by calling WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a dummy local path to satisfy the method parameter requirements, but the path in windows will contain a colon which is not valid for hadoop. I removed the potential driver letter and colon. I found one email from spark-user ever talked about this bug (https://www.mail-archive.com/user@spark.apache.org/msg55030.html) was: When driver failed over, it will read WAL from HDFS by calling WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a dummy local path to satisfy the method parameter requirements, but the path in windows will contain a colon which is not valid for hadoop. I removed the potential driver letter and colon. I found one email from spark-user ever talked about [this bug](https://www.mail-archive.com/user@spark.apache.org/msg55030.html) > Support WAL recover in windows > -- > > Key: SPARK-21239 > URL: https://issues.apache.org/jira/browse/SPARK-21239 > Project: Spark > Issue Type: Bug > Components: DStreams, Windows >Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0 >Reporter: Yun Tang > Fix For: 2.1.2, 2.2.1 > > > When driver failed over, it will read WAL from HDFS by calling > WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a > dummy local path to satisfy the method parameter requirements, but the path > in windows will contain a colon which is not valid for hadoop. I removed the > potential driver letter and colon. > I found one email from spark-user ever talked about this bug > (https://www.mail-archive.com/user@spark.apache.org/msg55030.html) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21239) Support WAL recover in windows
Yun Tang created SPARK-21239: Summary: Support WAL recover in windows Key: SPARK-21239 URL: https://issues.apache.org/jira/browse/SPARK-21239 Project: Spark Issue Type: Bug Components: DStreams, Windows Affects Versions: 2.1.1, 2.1.0, 1.6.3, 2.2.0 Reporter: Yun Tang Fix For: 2.1.2, 2.2.1 When driver failed over, it will read WAL from HDFS by calling WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a dummy local path to satisfy the method parameter requirements, but the path in windows will contain a colon which is not valid for hadoop. I removed the potential driver letter and colon. I found one email from spark-user ever talked about [this bug](https://www.mail-archive.com/user@spark.apache.org/msg55030.html) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17091: Assignee: (was: Apache Spark) > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066222#comment-16066222 ] Apache Spark commented on SPARK-17091: -- User 'ptkool' has created a pull request for this issue: https://github.com/apache/spark/pull/18424 > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17091: Assignee: Apache Spark > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy >Assignee: Apache Spark > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet edited comment on SPARK-21233 at 6/28/17 9:14 AM: Hi [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . was (Author: darion): [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM: [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . was (Author: darion): [Sean|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM: [Sean|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . was (Author: darion): [Sean Owen|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21137) Spark reads many small files slowly off local filesystem
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-21137: --- Summary: Spark reads many small files slowly off local filesystem (was: Spark reads many small files slowly) > Spark reads many small files slowly off local filesystem > > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: sam >Priority: Minor > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179 ] darion yaphet commented on SPARK-21233: --- [Sean Owen|sro...@gmail.com] In Kafka-0.8 it's using zkClient to commit offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I wish to add some config item to control the storage instance and other parameter . > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
[ https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066162#comment-16066162 ] Apache Spark commented on SPARK-18004: -- User 'SharpRay' has created a pull request for this issue: https://github.com/apache/spark/pull/18451 > DataFrame filter Predicate push-down fails for Oracle Timestamp type columns > > > Key: SPARK-18004 > URL: https://issues.apache.org/jira/browse/SPARK-18004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Suhas Nalapure >Priority: Critical > > DataFrame filter Predicate push-down fails for Oracle Timestamp type columns > with Exception java.sql.SQLDataException: ORA-01861: literal does not match > format string: > Java source code (this code works fine for mysql & mssql databases) : > {noformat} > //DataFrame df = create a DataFrame over an Oracle table > df = df.filter(df.col("TS").lt(new > java.sql.Timestamp(System.currentTimeMillis(; > df.explain(); > df.show(); > {noformat} > Log statements with the Exception: > {noformat} > Schema: root > |-- ID: string (nullable = false) > |-- TS: timestamp (nullable = true) > |-- DEVICE_ID: string (nullable = true) > |-- REPLACEMENT: string (nullable = true) > {noformat} > {noformat} > == Physical Plan == > Filter (TS#1 < 1476861841934000) > +- Scan > JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user, > password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, > driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] > PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)] > 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] > org.apache.spark.executor.Executor > Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: ORA-01861: literal does not match format string > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461) > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402) > at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065) > at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681) > at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256) > at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577) > at > oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239) > at > oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75) > at > oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043) > at > oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:) > at > oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353) > at > oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485) > at > oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566) > at > oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe
[ https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066151#comment-16066151 ] Seydou Dia commented on SPARK-21227: Hi [~hyukjin.kwon], thanks for confirming this. I happen to be a Python dev and jr. scala. I would love to help on this, with some guidance I think can help. Best, Seydou > Unicode in Json field causes AnalysisException when selecting from Dataframe > > > Key: SPARK-21227 > URL: https://issues.apache.org/jira/browse/SPARK-21227 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Seydou Dia > > Hi, > please find below the step to reproduce the issue I am facing. > First I create a json with 2 fields: > * city_name > * cıty_name > The first one is valid ascii, while the second contains a unicode (ı, i > without dot ). > When I try to select from the dataframe I have an {noformat} > AnalysisException {noformat}. > {code:python} > $ pyspark > Python 3.4.3 (default, Sep 1 2016, 23:33:38) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. > Attempting port 4042. > 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 3.4.3 (default, Sep 1 2016 23:33:38) > SparkSession available as 'spark'. > >>> sc=spark.sparkContext > >>> js = ['{"city_name": "paris"}' > ... , '{"city_name": "rome"}' > ... , '{"city_name": "berlin"}' > ... , '{"cıty_name": "new-york"}' > ... , '{"cıty_name": "toronto"}' > ... , '{"cıty_name": "chicago"}' > ... , '{"cıty_name": "dubai"}'] > >>> myRDD = sc.parallelize(js) > >>> myDF = spark.read.json(myRDD) > >>> myDF.printSchema() > >>> > root > |-- city_name: string (nullable = true) > |-- cıty_name: string (nullable = true) > >>> myDF.select(myDF['city_name']) > Traceback (most recent call last): > File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line > 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply. > : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, > could be: city_name#29, city_name#30.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1073) > at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in > __getitem__ > jc = self._jdf.apply(item) > File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, > could be: city_name#29, city_name#30.;" > {code} -- This message was sent by Atlassian JIRA
[jira] [Assigned] (SPARK-21238) allow nested SQL execution
[ https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21238: Assignee: Wenchen Fan (was: Apache Spark) > allow nested SQL execution > -- > > Key: SPARK-21238 > URL: https://issues.apache.org/jira/browse/SPARK-21238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21238) allow nested SQL execution
[ https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066130#comment-16066130 ] Apache Spark commented on SPARK-21238: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18450 > allow nested SQL execution > -- > > Key: SPARK-21238 > URL: https://issues.apache.org/jira/browse/SPARK-21238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21238) allow nested SQL execution
[ https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21238: Assignee: Apache Spark (was: Wenchen Fan) > allow nested SQL execution > -- > > Key: SPARK-21238 > URL: https://issues.apache.org/jira/browse/SPARK-21238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21238) allow nested SQL execution
Wenchen Fan created SPARK-21238: --- Summary: allow nested SQL execution Key: SPARK-21238 URL: https://issues.apache.org/jira/browse/SPARK-21238 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed
[ https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21237: Assignee: Apache Spark > Invalidate stats once table data is changed > --- > > Key: SPARK-21237 > URL: https://issues.apache.org/jira/browse/SPARK-21237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21237) Invalidate stats once table data is changed
[ https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066097#comment-16066097 ] Apache Spark commented on SPARK-21237: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/18449 > Invalidate stats once table data is changed > --- > > Key: SPARK-21237 > URL: https://issues.apache.org/jira/browse/SPARK-21237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed
[ https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21237: Assignee: (was: Apache Spark) > Invalidate stats once table data is changed > --- > > Key: SPARK-21237 > URL: https://issues.apache.org/jira/browse/SPARK-21237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21237) Invalidate stats once table data is changed
Zhenhua Wang created SPARK-21237: Summary: Invalidate stats once table data is changed Key: SPARK-21237 URL: https://issues.apache.org/jira/browse/SPARK-21237 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21233) Support pluggable offset storage
[ https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066088#comment-16066088 ] Sean Owen commented on SPARK-21233: --- Where would you put it instead? Kafka already provides for storing offsets in Kafka. This sounds like a big change and I don't see detail or design here. > Support pluggable offset storage > > > Key: SPARK-21233 > URL: https://issues.apache.org/jira/browse/SPARK-21233 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2, 2.1.1 >Reporter: darion yaphet > > Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there > are a lot of streaming program running in the cluster the ZooKeeper Cluster's > loading is very high . Maybe Zookeeper is not very suitable to save offset > periodicity. > This issue is wish to support a pluggable offset storage to avoid save it in > the zookeeper -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21234) When the function returns Option[Iterator[_]] is None,then get on None will cause java.util.NoSuchElementException: None.get
[ https://issues.apache.org/jira/browse/SPARK-21234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21234. --- Resolution: Invalid Not if it's known the value exists. I don't see you've established any actual problem here. Please read http://spark.apache.org/contributing.html > When the function returns Option[Iterator[_]] is None,then get on None will > cause java.util.NoSuchElementException: None.get > > > Key: SPARK-21234 > URL: https://issues.apache.org/jira/browse/SPARK-21234 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.1 >Reporter: wangjiaochun > > Class BlockManager { > ... > def getLocalValues(blockId: BlockId): Option[BlockResult] ={ > ... > memoryStore.getValues(blockId).get > ... > } > .. > } > The above code getValues return three type values: > None,IllegalArgumentException and normal ,if return None,Cause > java.util.NoSuchElementException: None.get。so I think this is potential Bug; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20889) SparkR grouped documentation for Column methods
[ https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066033#comment-16066033 ] Apache Spark commented on SPARK-20889: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/18448 > SparkR grouped documentation for Column methods > --- > > Key: SPARK-20889 > URL: https://issues.apache.org/jira/browse/SPARK-20889 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: documentation > Fix For: 2.3.0 > > > Group the documentation of individual methods defined for the Column class. > This aims to create the following improvements: > - Centralized documentation for easy navigation (user can view multiple > related methods on one single page). > - Reduced number of items in Seealso. > - Betters examples using shared data. This avoids creating a data frame for > each function if they are documented separately. And more importantly, user > can copy and paste to run them directly! > - Cleaner structure and much fewer Rd files (remove a large number of Rd > files). > - Remove duplicated definition of param (since they share exactly the same > argument). > - No need to write meaningless examples for trivial functions (because of > grouping). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org