[jira] [Commented] (SPARK-19778) alais cannot use in group by
[ https://issues.apache.org/jira/browse/SPARK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889700#comment-15889700 ] Takeshi Yamamuro commented on SPARK-19778: -- {code} scala> Seq(("a", 0), ("b", 1)).toDF("key", "value").createOrReplaceTempView("t") scala> sql("SELECT key AS key1 FROM t GROUP BY key1") org.apache.spark.sql.AnalysisException: cannot resolve '`key1`' given input columns: [key, value]; line 1 pos 35; 'Aggregate ['key1], [key#15 AS key1#21] +- SubqueryAlias t +- Project [_1#12 AS key#15, _2#13 AS value#16] +- LocalRelation [_1#12, _2#13] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:75) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql {code} Aha, it makes some sense to me because, for example, postgresql accepts this query; {code} postgres=# create table t(key INT, value INT); CREATE TABLE postgres=# insert into t values(1, 0); INSERT 0 1 postgres=# SELECT key AS key1 FROM t GROUP BY key1; key1 -- 1 (1 row) {code} > alais cannot use in group by > > > Key: SPARK-19778 > URL: https://issues.apache.org/jira/browse/SPARK-19778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: xukun > > not support “select key as key1 from src group by key1” -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"
[ https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889698#comment-15889698 ] Amit Baghel commented on SPARK-19768: - I am using aggregate query with format="parquet" and outputMode="append" but it is throwing "Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets;;" > Error for both aggregate and non-aggregate queries in Structured Streaming > - "This query does not support recovering from checkpoint location" > - > > Key: SPARK-19768 > URL: https://issues.apache.org/jira/browse/SPARK-19768 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Amit Baghel > > I am running JavaStructuredKafkaWordCount.java example with > checkpointLocation. Output mode is "complete". Below is relevant code. > {code} > // Generate running word count > Dataset wordCounts = lines.flatMap(new FlatMapFunctionString>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).groupBy("value").count(); > // Start running the query that prints the running counts to the console > StreamingQuery query = wordCounts.writeStream() > .outputMode("complete") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This example runs successfully and writes data in checkpoint directory. When > I re-run the program it throws below exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} > Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query > with output mode as "append". Please see the code below. > {code} > // no aggregations > Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).select("value"); > // append mode with console > StreamingQuery query = wordCounts.writeStream() > .outputMode("append") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This modified code runs successfully and writes data in checkpoint directory. > When I re-run the program it throws same exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889677#comment-15889677 ] Shridhar Ramachandran commented on SPARK-5159: -- I have faced this issue as well, on both 1.6 and 2.0. Some solutions have indicated setting hive.metastore.execute.setugi to true on the metastore as well as the thrift server, but this did not help. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889677#comment-15889677 ] Shridhar Ramachandran edited comment on SPARK-5159 at 3/1/17 7:29 AM: -- I am facing this issue as well, on both 1.6 and 2.0. Some solutions have indicated setting hive.metastore.execute.setugi to true on the metastore as well as the thrift server, but this did not help. was (Author: shridharama): I have faced this issue as well, on both 1.6 and 2.0. Some solutions have indicated setting hive.metastore.execute.setugi to true on the metastore as well as the thrift server, but this did not help. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889654#comment-15889654 ] Apache Spark commented on SPARK-19758: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17114 > Casting string to timestamp in inline table definition fails with > AnalysisException > --- > > Key: SPARK-19758 > URL: https://issues.apache.org/jira/browse/SPARK-19758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Priority: Blocker > > The following query runs succesfully on Spark 2.1.x but fails in the current > master: > {code} > sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > {code} > Here's the error: > {code} > scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > org.apache.spark.sql.AnalysisException: failed to evaluate expression > CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128) > at >
[jira] [Assigned] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19758: Assignee: (was: Apache Spark) > Casting string to timestamp in inline table definition fails with > AnalysisException > --- > > Key: SPARK-19758 > URL: https://issues.apache.org/jira/browse/SPARK-19758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Priority: Blocker > > The following query runs succesfully on Spark 2.1.x but fails in the current > master: > {code} > sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > {code} > Here's the error: > {code} > scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > org.apache.spark.sql.AnalysisException: failed to evaluate expression > CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at >
[jira] [Assigned] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19758: Assignee: Apache Spark > Casting string to timestamp in inline table definition fails with > AnalysisException > --- > > Key: SPARK-19758 > URL: https://issues.apache.org/jira/browse/SPARK-19758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > The following query runs succesfully on Spark 2.1.x but fails in the current > master: > {code} > sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > {code} > Here's the error: > {code} > scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES > TIMESTAMP('1991-12-06 00:00:00.0')""") > org.apache.spark.sql.AnalysisException: failed to evaluate expression > CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at >
[jira] [Resolved] (SPARK-19633) FileSource read from FileSink
[ https://issues.apache.org/jira/browse/SPARK-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19633. -- Resolution: Fixed Assignee: Liwei Lin Fix Version/s: 2.2.0 > FileSource read from FileSink > - > > Key: SPARK-19633 > URL: https://issues.apache.org/jira/browse/SPARK-19633 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Liwei Lin >Priority: Critical > Fix For: 2.2.0 > > > Right now, you can't start a streaming query from a location that is being > written to by the file sink. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889645#comment-15889645 ] Tejas Patil commented on SPARK-17495: - >> Is it possible to figure out the hashing function based on file names? The way datasource files are named is different than hive so this will work. I was thinking about more simpler way: use hive-hash only when writing to hive bucketed tables. Since Spark doesn't support hive bucketing at the moment, any old data would be generated has to be from Hive so, this will not cause breakages for users. >> 3. In general it'd be useful to allow users to configure which actual hash >> function "hash" maps to. This can be a dynamic config. For any operations related to hive bucketed tables, we should not let users be able to change the hashing function and do the right thing underneath. Else, users can shoot themselves in foot (eg. joining two hive tables both bucketed but using different hashing function). One option was to store the hashing function used to populate a table in metastore. But this won't be compatible with Hive and mess things in environments where both Spark and Hive is used together. As for simple `hash()` UDF / function is concerned, I am a bit conservative about adding a dynamic config as I feel it might cause problems. Say you start off a session with default murmur3 hash and compute some data, cache it. Later on user switches to use hive hash and reusing the cached data as-is now wont be right thing to do. Keeping it static for a session would save from such problem. > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"
[ https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889639#comment-15889639 ] Shixiong Zhu commented on SPARK-19768: -- It should work for both aggregate and non-aggregate queries, but it only supports "append" mode. > Error for both aggregate and non-aggregate queries in Structured Streaming > - "This query does not support recovering from checkpoint location" > - > > Key: SPARK-19768 > URL: https://issues.apache.org/jira/browse/SPARK-19768 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Amit Baghel > > I am running JavaStructuredKafkaWordCount.java example with > checkpointLocation. Output mode is "complete". Below is relevant code. > {code} > // Generate running word count > Dataset wordCounts = lines.flatMap(new FlatMapFunctionString>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).groupBy("value").count(); > // Start running the query that prints the running counts to the console > StreamingQuery query = wordCounts.writeStream() > .outputMode("complete") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This example runs successfully and writes data in checkpoint directory. When > I re-run the program it throws below exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} > Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query > with output mode as "append". Please see the code below. > {code} > // no aggregations > Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).select("value"); > // append mode with console > StreamingQuery query = wordCounts.writeStream() > .outputMode("append") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This modified code runs successfully and writes data in checkpoint directory. > When I re-run the program it throws same exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19752) OrcGetSplits fails with 0 size files
[ https://issues.apache.org/jira/browse/SPARK-19752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889616#comment-15889616 ] Liang-Chi Hsieh commented on SPARK-19752: - >From the log, looks like it is a problem in Hive? > OrcGetSplits fails with 0 size files > > > Key: SPARK-19752 > URL: https://issues.apache.org/jira/browse/SPARK-19752 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Nick Orka > > There is a possibility that during some sql queries a partition may have a 0 > size file (empty file). Next time when I try to read from the file by sql > query, I'm getting this error: > 17/02/27 10:33:11 INFO PerfLogger: start=1488191591570 end=1488191591599 duration=29 > from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> > 17/02/27 10:33:11 ERROR ApplicationMaster: User class threw exception: > java.lang.reflect.InvocationTargetException > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror1.jinvokeraw(JavaMirrors.scala:373) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaMethodMirror.jinvoke(JavaMirrors.scala:339) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror.apply(JavaMirrors.scala:355) > at com.sessionm.Datapipeline$.main(Datapipeline.scala:200) > at com.sessionm.Datapipeline.main(Datapipeline.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) > Caused by: java.lang.RuntimeException: serious problem > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84) > at > scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115) > at > scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62) > at > scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) > at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) > at > scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:169) > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149) > at >
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889611#comment-15889611 ] Shixiong Zhu commented on SPARK-19764: -- These are master and workers. From the master log, you are using pyspark with the client mode. The driver logs should just output to the console. Could you paste the output of pyspark shell? > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19460. -- Resolution: Fixed Assignee: Miao Wang Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Miao Wang > Fix For: 2.2.0 > > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889588#comment-15889588 ] Zheng Shao commented on SPARK-6951: --- Did we consider using a distributed store to solve the scalability issue? Single-node LevelDB will hit scalability issue soon. > History server slow startup if the event log directory is large > --- > > Key: SPARK-6951 > URL: https://issues.apache.org/jira/browse/SPARK-6951 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah > > I started my history server, then navigated to the web UI where I expected to > be able to view some completed applications, but the webpage was not > available. It turned out that the History Server was not finished parsing all > of the event logs in the event log directory that I had specified. I had > accumulated a lot of event logs from months of running Spark, so it would > have taken a very long time for the History Server to crunch through them > all. I purged the event log directory and started from scratch, and the UI > loaded immediately. > We should have a pagination strategy or parse the directory lazily to avoid > needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19572) Allow to disable hive in sparkR shell
[ https://issues.apache.org/jira/browse/SPARK-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19572. -- Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: 2.2.0 2.1.1 Target Version/s: 2.1.1, 2.2.0 > Allow to disable hive in sparkR shell > - > > Key: SPARK-19572 > URL: https://issues.apache.org/jira/browse/SPARK-19572 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This > is not only for sparkR itself, but can also benefit downstream project like > livy which use shell.R for its interactive session. For now, livy has no > control of whether enable hive or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889566#comment-15889566 ] Ari Gesher commented on SPARK-19764: And here's the stuck Executor: {noformat} Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode): "shuffle-server-1" #30 daemon prio=5 os_prio=0 tid=0x7fdf9400d800 nid=0xd22 runnable [0x7fdfc4726000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc014b990> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc014da10> (a java.util.Collections$UnmodifiableSet) - locked <0xc014b8f8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "File appending thread for /home/ubuntu/app-20170228223629-/1/stderr" #56 daemon prio=5 os_prio=0 tid=0x7fdf7803b000 nid=0xce7 runnable [0x7fdfc4827000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0xf16a3010> (a java.lang.UNIXProcess$ProcessPipeInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply$mcV$sp(FileAppender.scala:68) at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62) at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:78) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) "File appending thread for /home/ubuntu/app-20170228223629-/1/stdout" #55 daemon prio=5 os_prio=0 tid=0x7fdf78036800 nid=0xcde runnable [0x7fdfc4e2b000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0xf16a0f50> (a java.lang.UNIXProcess$ProcessPipeInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply$mcV$sp(FileAppender.scala:68) at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62) at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:78) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) "process reaper" #54 daemon prio=10 os_prio=0 tid=0x7fdf78032800 nid=0xcdc runnable [0x7fdfc6de2000] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.waitForProcessExit(Native Method) at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:289)
[jira] [Comment Edited] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889564#comment-15889564 ] Ari Gesher edited comment on SPARK-19764 at 3/1/17 6:05 AM: Nothing like that. Full logs in the attached tarball. Here's the stack trace from master: {noformat} Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode): "shuffle-server-7" #36 daemon prio=5 os_prio=0 tid=0x7f0764019800 nid=0xcd7 runnable [0x7f0720684000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc0014950> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc00169d0> (a java.util.Collections$UnmodifiableSet) - locked <0xc00148a8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-6" #35 daemon prio=5 os_prio=0 tid=0x7f0764017800 nid=0xb6a runnable [0x7f07218f7000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc0017198> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc0019218> (a java.util.Collections$UnmodifiableSet) - locked <0xc0017100> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-5" #34 daemon prio=5 os_prio=0 tid=0x7f0764015800 nid=0xb69 runnable [0x7f07219f8000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc00199e0> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc01eee80> (a java.util.Collections$UnmodifiableSet) - locked <0xc0019948> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-4" #33 daemon prio=5 os_prio=0 tid=0x7f0764014000 nid=0xb1f runnable [0x7f0721af9000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc01ef648> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc01f16c8> (a java.util.Collections$UnmodifiableSet) - locked <0xc01ef5b0> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-3" #32 daemon prio=5 os_prio=0 tid=0x7f0764012000 nid=0xb0d runnable [0x7f0721786000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at
[jira] [Commented] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889562#comment-15889562 ] Apache Spark commented on SPARK-13669: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/17113 > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the actual problem is Spark’s scheduler is not aware of the > unavailability of external shuffle service, and will still assign the tasks > on to that nodes. The fix is to avoid assigning tasks on to that nodes. > Currently in the Spark, one related configuration is > “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be > worked in this scenario. This configuration is used to avoid same reattempt > task to run on the same executor. Also ways like MapReduce’s blacklist > mechanism may not handle this scenario, since all the reduce tasks will be > failed, so counting the failure tasks will equally mark all the executors as > “bad” one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13669: Assignee: Apache Spark > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Assignee: Apache Spark > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the actual problem is Spark’s scheduler is not aware of the > unavailability of external shuffle service, and will still assign the tasks > on to that nodes. The fix is to avoid assigning tasks on to that nodes. > Currently in the Spark, one related configuration is > “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be > worked in this scenario. This configuration is used to avoid same reattempt > task to run on the same executor. Also ways like MapReduce’s blacklist > mechanism may not handle this scenario, since all the reduce tasks will be > failed, so counting the failure tasks will equally mark all the executors as > “bad” one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13669: Assignee: (was: Apache Spark) > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the actual problem is Spark’s scheduler is not aware of the > unavailability of external shuffle service, and will still assign the tasks > on to that nodes. The fix is to avoid assigning tasks on to that nodes. > Currently in the Spark, one related configuration is > “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be > worked in this scenario. This configuration is used to avoid same reattempt > task to run on the same executor. Also ways like MapReduce’s blacklist > mechanism may not handle this scenario, since all the reduce tasks will be > failed, so counting the failure tasks will equally mark all the executors as > “bad” one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889564#comment-15889564 ] Ari Gesher commented on SPARK-19764: Nothing like that. Full logs in the attached tarball. Here's the stack trace from a stuck executor: {noformat} Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode): "shuffle-server-7" #36 daemon prio=5 os_prio=0 tid=0x7f0764019800 nid=0xcd7 runnable [0x7f0720684000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc0014950> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc00169d0> (a java.util.Collections$UnmodifiableSet) - locked <0xc00148a8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-6" #35 daemon prio=5 os_prio=0 tid=0x7f0764017800 nid=0xb6a runnable [0x7f07218f7000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc0017198> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc0019218> (a java.util.Collections$UnmodifiableSet) - locked <0xc0017100> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-5" #34 daemon prio=5 os_prio=0 tid=0x7f0764015800 nid=0xb69 runnable [0x7f07219f8000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc00199e0> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc01eee80> (a java.util.Collections$UnmodifiableSet) - locked <0xc0019948> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-4" #33 daemon prio=5 os_prio=0 tid=0x7f0764014000 nid=0xb1f runnable [0x7f0721af9000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xc01ef648> (a io.netty.channel.nio.SelectedSelectionKeySet) - locked <0xc01f16c8> (a java.util.Collections$UnmodifiableSet) - locked <0xc01ef5b0> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) "shuffle-server-3" #32 daemon prio=5 os_prio=0 tid=0x7f0764012000 nid=0xb0d runnable [0x7f0721786000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at
[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Gesher updated SPARK-19764: --- Attachment: SPARK-19764.tgz > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19779) structured streaming exist residual tmp file
[ https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Gui updated SPARK-19779: - Description: The PR (https://github.com/apache/spark/pull/17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future, so we need to delete the tmp file after restart streaming job. (was: The PR (https://github.com/apache/spark/pull/17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but that exist an problem that an tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file which generated when restart streaming job.) > structured streaming exist residual tmp file > - > > Key: SPARK-19779 > URL: https://issues.apache.org/jira/browse/SPARK-19779 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Feng Gui >Priority: Minor > > The PR (https://github.com/apache/spark/pull/17012) can to fix restart a > Structured Streaming application using hdfs as fileSystem, but also exist a > problem that a tmp file of delta file is still reserved in hdfs. And > Structured Streaming don't delete the tmp file generated when restart > streaming job in future, so we need to delete the tmp file after restart > streaming job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19779) structured streaming exist residual tmp file
Feng Gui created SPARK-19779: Summary: structured streaming exist residual tmp file Key: SPARK-19779 URL: https://issues.apache.org/jira/browse/SPARK-19779 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Feng Gui Priority: Minor The PR (https://github.com/apache/spark/pull/17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but that exist an problem that an tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file which generated when restart streaming job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19776) Is the JavaKafkaWordCount example correct for Spark version 2.1?
[ https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Abedin updated SPARK-19776: --- Description: My question is https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java correct? I'm pretty new to both Spark and Java. I wanted to find an example of Spark Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java looked to be perfect. However, when I tried running it, I found a couple of issues that I needed to overcome. 1. This line was unnecessary: {code} StreamingExamples.setStreamingLogLevels(); {code} Having this line in there (and the associated import) caused me to go looking for a dependency spark-examples_2.10 which of no real use to me. 2. After running it, this line: {code} JavaPairReceiverInputDStreammessages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); {code} Appeared to throw an error around logging: {code} Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91 at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66 at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11 at org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala) at main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72) {code} To get around this, I found that the code sample in https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html helped me to come up with the right lines to see streaming from Kafka in action. Specifically this called createDirectStream instead of createStream. So is the example in https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java or is there something I could have done differently to get that example working? was: My question is https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java correct? I'm pretty new to both Spark and Java. I wanted to find an example of Spark Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java looked to be perfect. However, when I tried running it, I found a couple of issues that I needed to overcome. 1. This line was unnecessary: {code} StreamingExamples.setStreamingLogLevels(); {code} Having this line in there (and the associated import) caused me to go looking for a dependency spark-examples_2.10 which of no real use to me. 2. After running it, I this line: {code} JavaPairReceiverInputDStream messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); {code} Appeared to throw an error around logging: {code} Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91 at
[jira] [Updated] (SPARK-19776) Is the JavaKafkaWordCount example correct for Spark version 2.1?
[ https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Abedin updated SPARK-19776: --- Summary: Is the JavaKafkaWordCount example correct for Spark version 2.1? (was: Is the JavaKafkaWordCount correct on Spark version 2.1) > Is the JavaKafkaWordCount example correct for Spark version 2.1? > > > Key: SPARK-19776 > URL: https://issues.apache.org/jira/browse/SPARK-19776 > Project: Spark > Issue Type: Question > Components: Examples, ML >Affects Versions: 2.1.0 >Reporter: Russell Abedin > > My question is > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > correct? > I'm pretty new to both Spark and Java. I wanted to find an example of Spark > Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > looked to be perfect. > However, when I tried running it, I found a couple of issues that I needed to > overcome. > 1. This line was unnecessary: > {code} > StreamingExamples.setStreamingLogLevels(); > {code} > Having this line in there (and the associated import) caused me to go looking > for a dependency spark-examples_2.10 which of no real use to me. > 2. After running it, I this line: > {code} > JavaPairReceiverInputDStreammessages = > KafkaUtils.createStream(jssc, args[0], args[1], topicMap); > {code} > Appeared to throw an error around logging: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/Logging > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) > at java.net.URLClassLoader.access$100(URLClassLoader.java:73) > at java.net.URLClassLoader$1.run(URLClassLoader.java:368) > at java.net.URLClassLoader$1.run(URLClassLoader.java:362) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:361) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91 > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66 > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11 > at > org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala) > at > main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72) > {code} > To get around this, I found that the code sample in > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > helped me to come up with the right lines to see streaming from Kafka in > action. Specifically this called createDirectStream instead of createStream. > So is the example in > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > or is there something I could have done differently to get that example > working? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"
[ https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889472#comment-15889472 ] Amit Baghel edited comment on SPARK-19768 at 3/1/17 4:14 AM: - Thanks [~zsxwing] for clarification. Documentation for structured streaming is missing this piece of information. The error thrown in case of console sink with checkpoint should be more meaningful. I have one more question. Does file sink using "parquet" and checkpoint work only for non-aggregate query? I tried this for both aggregate and non-aggregate queries and I am getting exception for aggregate query. was (Author: baghelamit): Thanks [~zsxwing] for clarification. Documentation for structured streaming missing this piece of information. The error thrown in case of console sink with checkpoint should be more meaningful. I have one more question. Does file sink using "parquet" and checkpoint work only for non-aggregate query? I tried this for both aggregate and non-aggregate queries and I am getting exception for aggregate query. > Error for both aggregate and non-aggregate queries in Structured Streaming > - "This query does not support recovering from checkpoint location" > - > > Key: SPARK-19768 > URL: https://issues.apache.org/jira/browse/SPARK-19768 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Amit Baghel > > I am running JavaStructuredKafkaWordCount.java example with > checkpointLocation. Output mode is "complete". Below is relevant code. > {code} > // Generate running word count > Dataset wordCounts = lines.flatMap(new FlatMapFunctionString>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).groupBy("value").count(); > // Start running the query that prints the running counts to the console > StreamingQuery query = wordCounts.writeStream() > .outputMode("complete") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This example runs successfully and writes data in checkpoint directory. When > I re-run the program it throws below exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} > Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query > with output mode as "append". Please see the code below. > {code} > // no aggregations > Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).select("value"); > // append mode with console > StreamingQuery query = wordCounts.writeStream() > .outputMode("append") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This modified code runs successfully and writes data in checkpoint directory. > When I re-run the program it throws same exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"
[ https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889472#comment-15889472 ] Amit Baghel commented on SPARK-19768: - Thanks [~zsxwing] for clarification. Documentation for structured streaming missing this piece of information. The error thrown in case of console sink with checkpoint should be more meaningful. I have one more question. Does file sink using "parquet" and checkpoint work only for non-aggregate query? I tried this for both aggregate and non-aggregate queries and I am getting exception for aggregate query. > Error for both aggregate and non-aggregate queries in Structured Streaming > - "This query does not support recovering from checkpoint location" > - > > Key: SPARK-19768 > URL: https://issues.apache.org/jira/browse/SPARK-19768 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Amit Baghel > > I am running JavaStructuredKafkaWordCount.java example with > checkpointLocation. Output mode is "complete". Below is relevant code. > {code} > // Generate running word count > Dataset wordCounts = lines.flatMap(new FlatMapFunctionString>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).groupBy("value").count(); > // Start running the query that prints the running counts to the console > StreamingQuery query = wordCounts.writeStream() > .outputMode("complete") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This example runs successfully and writes data in checkpoint directory. When > I re-run the program it throws below exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} > Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query > with output mode as "append". Please see the code below. > {code} > // no aggregations > Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() { > @Override > public Iterator call(String x) { > return Arrays.asList(x.split(" ")).iterator(); > } > }, Encoders.STRING()).select("value"); > // append mode with console > StreamingQuery query = wordCounts.writeStream() > .outputMode("append") > .format("console") > .option("checkpointLocation", "/tmp/checkpoint-data") > .start(); > {code} > This modified code runs successfully and writes data in checkpoint directory. > When I re-run the program it throws same exception > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: This query > does not support recovering from checkpoint location. Delete > /tmp/checkpoint-data/offsets to start over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > at > com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks
[ https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889435#comment-15889435 ] Apache Spark commented on SPARK-16929: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/17112 > Speculation-related synchronization bottleneck in checkSpeculatableTasks > > > Key: SPARK-16929 > URL: https://issues.apache.org/jira/browse/SPARK-16929 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Nicholas Brown > > Our cluster has been running slowly since I got speculation working, I looked > into it and noticed that stderr was saying some tasks were taking almost an > hour to run even though in the application logs on the nodes that task only > took a minute or so to run. Digging into the thread dump for the master node > I noticed a number of threads are blocked, apparently by speculation thread. > At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while > it looks through the tasks to see what needs to be rerun. Unfortunately that > code loops through each of the tasks, so when you have even just a couple > hundred thousand tasks to run that can be prohibitively slow to run inside of > a synchronized block. Once I disabled speculation, the job went back to > having acceptable performance. > There are no comments around that lock indicating why it was added, and the > git history seems to have a couple refactorings so its hard to find where it > was added. I'm tempted to believe it is the result of someone assuming that > an extra synchronized block never hurt anyone (in reality I've probably just > as many bugs caused by over synchronization as too little) as it looks too > broad to be actually guarding any potential concurrency issue. But, since > concurrency issues can be tricky to reproduce (and yes, I understand that's > an extreme understatement) I'm not sure just blindly removing it without > being familiar with the history is necessarily safe. > Can someone look into this? Or at least make a note in the documentation > that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19778) alais cannot use in group by
xukun created SPARK-19778: - Summary: alais cannot use in group by Key: SPARK-19778 URL: https://issues.apache.org/jira/browse/SPARK-19778 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: xukun not support “select key as key1 from src group by key1” -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889432#comment-15889432 ] Reynold Xin commented on SPARK-17495: - Is it possible to figure out the hashing function based on file names? e.g. are the Spark files named differently from Hive files? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating
[ https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889419#comment-15889419 ] Takeshi Yamamuro commented on SPARK-19754: -- cc: [~hyukjin.kwon] what do u think this? > Casting to int from a JSON-parsed float rounds instead of truncating > > > Key: SPARK-19754 > URL: https://issues.apache.org/jira/browse/SPARK-19754 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Juan Pumarino >Priority: Minor > > When retrieving a float value from a JSON document, and then casting it to an > integer, Hive simply truncates it, while Spark is rounding up when the > decimal value is >= 5. > In Hive, the following query returns {{1}}, whereas in a Spark shell the > result is {{2}}. > {code} > SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14698) CREATE FUNCTION cloud not add function to hive metastore
[ https://issues.apache.org/jira/browse/SPARK-14698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889416#comment-15889416 ] Shawn Lavelle commented on SPARK-14698: --- [~poseidon] Would you be willing (and still able) to upload the patch? > CREATE FUNCTION cloud not add function to hive metastore > > > Key: SPARK-14698 > URL: https://issues.apache.org/jira/browse/SPARK-14698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: spark1.6.1 >Reporter: poseidon > Labels: easyfix > > build spark 1.6.1 , and run it with 1.2.1 hive version,config mysql as > metastore server. > Start a thrift server , then in beeline , try to CREATE FUNCTION as HIVE SQL > UDF. > find out , can not add this FUNCTION to mysql metastore,but the function > usage goes well. > if you try to add it again , thrift server throw a alread Exist Exception. > [SPARK-10151][SQL] Support invocation of hive macro > add a if condition when runSqlHive, which will exec create function in > hiveexec. caused this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889412#comment-15889412 ] Marcelo Vanzin edited comment on SPARK-18769 at 3/1/17 3:04 AM: bq. What do you mean by this? I mean that, if I remember the code correctly, Spark's code will generate one container request per container it wants. So if the schedulers decides it wants 50k containers, the Spark allocator will be sending 50k request objects per heartbeat. That's memory used in the Spark allocator code, and results in memory used in the RM that's receiving the message. And if the RM code is naive, it will look at all those requests too when trying to allocate resources (increasing latency for the reply). was (Author: vanzin): bq. What do you mean by this? I mean that, if I remember the code correctly, Spark's code will generate one container request per container it wants. So if the schedulers decides it wants 50k containers, the Spark allocator will be sending 50k request objects per heartbeat. That's memory used in the Spark allocator code, and results in memory used in the RM that's receiving the message. And if the RM code is naive, it will look at all those requests too. > Spark to be smarter about what the upper bound is and to restrict number of > executor when dynamic allocation is enabled > > > Key: SPARK-18769 > URL: https://issues.apache.org/jira/browse/SPARK-18769 > Project: Spark > Issue Type: New Feature >Reporter: Neerja Khattar > > Currently when dynamic allocation is enabled max.executor is infinite and > spark creates so many executor and even exceed the yarn nodemanager memory > limit and vcores. > It should have a check to not exceed more that yarn resource limit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889412#comment-15889412 ] Marcelo Vanzin commented on SPARK-18769: bq. What do you mean by this? I mean that, if I remember the code correctly, Spark's code will generate one container request per container it wants. So if the schedulers decides it wants 50k containers, the Spark allocator will be sending 50k request objects per heartbeat. That's memory used in the Spark allocator code, and results in memory used in the RM that's receiving the message. And if the RM code is naive, it will look at all those requests too. > Spark to be smarter about what the upper bound is and to restrict number of > executor when dynamic allocation is enabled > > > Key: SPARK-18769 > URL: https://issues.apache.org/jira/browse/SPARK-18769 > Project: Spark > Issue Type: New Feature >Reporter: Neerja Khattar > > Currently when dynamic allocation is enabled max.executor is infinite and > spark creates so many executor and even exceed the yarn nodemanager memory > limit and vcores. > It should have a check to not exceed more that yarn resource limit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.
[ https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19777: Assignee: (was: Apache Spark) > Scan runningTasksSet when check speculatable tasks in TaskSetManager. > - > > Key: SPARK-19777 > URL: https://issues.apache.org/jira/browse/SPARK-19777 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > > When check speculatable tasks in TaskSetManager, only scan runningTasksSet > instead of scanning all taskInfos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.
[ https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19777: Assignee: Apache Spark > Scan runningTasksSet when check speculatable tasks in TaskSetManager. > - > > Key: SPARK-19777 > URL: https://issues.apache.org/jira/browse/SPARK-19777 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: Apache Spark >Priority: Minor > > When check speculatable tasks in TaskSetManager, only scan runningTasksSet > instead of scanning all taskInfos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.
[ https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889401#comment-15889401 ] Apache Spark commented on SPARK-19777: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/17111 > Scan runningTasksSet when check speculatable tasks in TaskSetManager. > - > > Key: SPARK-19777 > URL: https://issues.apache.org/jira/browse/SPARK-19777 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > > When check speculatable tasks in TaskSetManager, only scan runningTasksSet > instead of scanning all taskInfos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.
jin xing created SPARK-19777: Summary: Scan runningTasksSet when check speculatable tasks in TaskSetManager. Key: SPARK-19777 URL: https://issues.apache.org/jira/browse/SPARK-19777 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: jin xing Priority: Minor When check speculatable tasks in TaskSetManager, only scan runningTasksSet instead of scanning all taskInfos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19635: Assignee: Apache Spark (was: Joseph K. Bradley) > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Apache Spark > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19635: Assignee: Joseph K. Bradley (was: Apache Spark) > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Joseph K. Bradley > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889386#comment-15889386 ] Apache Spark commented on SPARK-19635: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/17110 > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Joseph K. Bradley > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889373#comment-15889373 ] Yuming Wang commented on SPARK-18769: - How about this approach: https://github.com/apache/spark/pull/16819 > Spark to be smarter about what the upper bound is and to restrict number of > executor when dynamic allocation is enabled > > > Key: SPARK-18769 > URL: https://issues.apache.org/jira/browse/SPARK-18769 > Project: Spark > Issue Type: New Feature >Reporter: Neerja Khattar > > Currently when dynamic allocation is enabled max.executor is infinite and > spark creates so many executor and even exceed the yarn nodemanager memory > limit and vcores. > It should have a check to not exceed more that yarn resource limit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-19764: Attachment: netty-6153.jpg something like this: !netty-6153.jpg! > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19740: Assignee: (was: Apache Spark) > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan >Priority: Minor > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19740: Assignee: Apache Spark > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan >Assignee: Apache Spark >Priority: Minor > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889331#comment-15889331 ] Apache Spark commented on SPARK-19740: -- User 'yanji84' has created a pull request for this issue: https://github.com/apache/spark/pull/17109 > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan >Priority: Minor > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert
[ https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889270#comment-15889270 ] Jiang Xingbo commented on SPARK-19211: -- I‘ve been working on it this week and perhaps I'll submit a PR later today. :) > Explicitly prevent Insert into View or Create View As Insert > > > Key: SPARK-19211 > URL: https://issues.apache.org/jira/browse/SPARK-19211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo > > Currently we don't explicitly forbid the following behaviors: > 1. The statement CREATE VIEW AS INSERT INTO throws the following exception > from SQLBuilder: > `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable > MetastoreRelation default, tbl, false, false`; > 2. The statement INSERT INTO view VALUES throws the following exception from > checkAnalysis: > `Error in query: Inserting into an RDD-based table is not allowed.;;` > We should check for these behaviors earlier and explicitly prevent them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19636) Feature parity for correlation statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19636: Assignee: Apache Spark (was: Tim Hunter) > Feature parity for correlation statistics in MLlib > -- > > Key: SPARK-19636 > URL: https://issues.apache.org/jira/browse/SPARK-19636 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Apache Spark > > This ticket tracks porting the functionality of spark.mllib.Statistics.corr() > over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19636) Feature parity for correlation statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19636: Assignee: Tim Hunter (was: Apache Spark) > Feature parity for correlation statistics in MLlib > -- > > Key: SPARK-19636 > URL: https://issues.apache.org/jira/browse/SPARK-19636 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Tim Hunter > > This ticket tracks porting the functionality of spark.mllib.Statistics.corr() > over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889220#comment-15889220 ] Apache Spark commented on SPARK-19636: -- User 'thunterdb' has created a pull request for this issue: https://github.com/apache/spark/pull/17108 > Feature parity for correlation statistics in MLlib > -- > > Key: SPARK-19636 > URL: https://issues.apache.org/jira/browse/SPARK-19636 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Tim Hunter > > This ticket tracks porting the functionality of spark.mllib.Statistics.corr() > over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19776) Is the JavaKafkaWordCount correct on Spark version 2.1
[ https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Abedin updated SPARK-19776: --- Summary: Is the JavaKafkaWordCount correct on Spark version 2.1 (was: JavaKafkaWordCount calls createStream on version 2.1) > Is the JavaKafkaWordCount correct on Spark version 2.1 > -- > > Key: SPARK-19776 > URL: https://issues.apache.org/jira/browse/SPARK-19776 > Project: Spark > Issue Type: Question > Components: Examples, ML >Affects Versions: 2.1.0 >Reporter: Russell Abedin > > My question is > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > correct? > I'm pretty new to both Spark and Java. I wanted to find an example of Spark > Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > looked to be perfect. > However, when I tried running it, I found a couple of issues that I needed to > overcome. > 1. This line was unnecessary: > {code} > StreamingExamples.setStreamingLogLevels(); > {code} > Having this line in there (and the associated import) caused me to go looking > for a dependency spark-examples_2.10 which of no real use to me. > 2. After running it, I this line: > {code} > JavaPairReceiverInputDStreammessages = > KafkaUtils.createStream(jssc, args[0], args[1], topicMap); > {code} > Appeared to throw an error around logging: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/Logging > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) > at java.net.URLClassLoader.access$100(URLClassLoader.java:73) > at java.net.URLClassLoader$1.run(URLClassLoader.java:368) > at java.net.URLClassLoader$1.run(URLClassLoader.java:362) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:361) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91 > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66 > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11 > at > org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala) > at > main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72) > {code} > To get around this, I found that the code sample in > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > helped me to come up with the right lines to see streaming from Kafka in > action. Specifically this called createDirectStream instead of createStream. > So is the example in > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > or is there something I could have done differently to get that example > working? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1
[ https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Abedin updated SPARK-19776: --- Affects Version/s: (was: 1.6.1) (was: 1.5.2) (was: 1.4.1) (was: 2.0.0) 2.1.0 > JavaKafkaWordCount calls createStream on version 2.1 > > > Key: SPARK-19776 > URL: https://issues.apache.org/jira/browse/SPARK-19776 > Project: Spark > Issue Type: Question > Components: Examples, ML >Affects Versions: 2.1.0 >Reporter: Russell Abedin > > My question is > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > correct? > I'm pretty new to both Spark and Java. I wanted to find an example of Spark > Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > looked to be perfect. > However, when I tried running it, I found a couple of issues that I needed to > overcome. > 1. This line was unnecessary: > {code} > StreamingExamples.setStreamingLogLevels(); > {code} > Having this line in there (and the associated import) caused me to go looking > for a dependency spark-examples_2.10 which of no real use to me. > 2. After running it, I this line: > {code} > JavaPairReceiverInputDStreammessages = > KafkaUtils.createStream(jssc, args[0], args[1], topicMap); > {code} > Appeared to throw an error around logging: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/Logging > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) > at java.net.URLClassLoader.access$100(URLClassLoader.java:73) > at java.net.URLClassLoader$1.run(URLClassLoader.java:368) > at java.net.URLClassLoader$1.run(URLClassLoader.java:362) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:361) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91 > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66 > at > org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11 > at > org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala) > at > main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72) > {code} > To get around this, I found that the code sample in > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > helped me to come up with the right lines to see streaming from Kafka in > action. Specifically this called createDirectStream instead of createStream. > So is the example in > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > or is there something I could have done differently to get that example > working? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1
[ https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Abedin updated SPARK-19776: --- Description: My question is https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java correct? I'm pretty new to both Spark and Java. I wanted to find an example of Spark Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java looked to be perfect. However, when I tried running it, I found a couple of issues that I needed to overcome. 1. This line was unnecessary: {code} StreamingExamples.setStreamingLogLevels(); {code} Having this line in there (and the associated import) caused me to go looking for a dependency spark-examples_2.10 which of no real use to me. 2. After running it, I this line: {code} JavaPairReceiverInputDStreammessages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); {code} Appeared to throw an error around logging: {code} Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91 at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66 at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11 at org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala) at main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72) {code} To get around this, I found that the code sample in https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html helped me to come up with the right lines to see streaming from Kafka in action. Specifically this called createDirectStream instead of createStream. So is the example in https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java or is there something I could have done differently to get that example working? was: This bug is reported by http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html . > JavaKafkaWordCount calls createStream on version 2.1 > > > Key: SPARK-19776 > URL: https://issues.apache.org/jira/browse/SPARK-19776 > Project: Spark > Issue Type: Question > Components: Examples, ML >Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0 >Reporter: Russell Abedin > > My question is > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > correct? > I'm pretty new to both Spark and Java. I wanted to find an example of Spark > Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at > https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > looked to be perfect. > However, when I tried running it, I found a couple of issues that I needed to > overcome. > 1. This line was unnecessary: > {code} > StreamingExamples.setStreamingLogLevels(); > {code} > Having this line in there (and the associated import) caused me to go looking > for a dependency spark-examples_2.10 which of no real use to me. > 2. After running it, I this line: > {code} > JavaPairReceiverInputDStream messages = > KafkaUtils.createStream(jssc, args[0], args[1], topicMap); > {code} > Appeared to throw an error around logging: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/Logging > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) >
[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180 ] Mingjie Tang edited comment on SPARK-19771 at 3/1/17 12:30 AM: --- If we follow the AND-OR framework, one optimization is here: At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5). For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], we can compute its mapped hash vectors is WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])] For the similarity-join, we map this computed vector into ++-++ |datasetA|entry| hashValue| ++-++ |[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...| |[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1| |[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...| Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. then, the mapped hash function is H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime. where r_i is a integer and the prime can be set as 2^32-5 for less hash collision. Then, we only need to store the hash value H(h_l) for AND operation. The similar approach also can be applied for the approxNeasetNeighbors searching. If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work. Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . [~yunn] [~mlnick] was (Author: merlintang): If we follow the AND-OR framework, one optimization is here: At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5). For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], we can compute its mapped hash vectors is WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])] For the similarity-join, we map this computed vector into ++-++ |datasetA|entry| hashValue| ++-++ |[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...| |[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1| |[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...| Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. then, the mapped hash function is H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime. where r_i is a integer and the prime can be set as 2^32-5 for less hash collision. Then, we only need to store the hash value H(h_l) for AND operation. The similar approach also can be applied for the approxNeasetNeighbors searching. If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work. Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . [~diefunction] [~mlnick] > Support OR-AND
[jira] [Assigned] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19635: - Assignee: Joseph K. Bradley > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Joseph K. Bradley > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889190#comment-15889190 ] Joseph K. Bradley commented on SPARK-19635: --- That PR for trees looks pretty different. This task is just for wrapping chiSqTest with a DataFrame API. I'm going to take this one. > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889181#comment-15889181 ] Joseph K. Bradley commented on SPARK-19634: --- I'll assign this to [~timhunter] given the time pressure for 2.2 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180 ] Mingjie Tang commented on SPARK-19771: -- If we follow the AND-OR framework, one optimization is here: At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5). For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], we can compute its mapped hash vectors is WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])] For the similarity-join, we map this computed vector into ++-++ |datasetA|entry| hashValue| ++-++ |[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...| |[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1| |[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...| Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. then, the mapped hash function is H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime. where r_i is a integer and the prime can be set as 2^32-5 for less hash collision. Then, we only need to store the hash value H(h_l) for AND operation. The similar approach also can be applied for the approxNeasetNeighbors searching. If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work. Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . [~diefunction] [~mlnick] > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19634: - Assignee: Timothy Hunter > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1
[ https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Abedin updated SPARK-19776: --- Issue Type: Question (was: Bug) > JavaKafkaWordCount calls createStream on version 2.1 > > > Key: SPARK-19776 > URL: https://issues.apache.org/jira/browse/SPARK-19776 > Project: Spark > Issue Type: Question > Components: Examples, ML >Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0 >Reporter: Russell Abedin > > This bug is reported by > http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html > . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1
Russell Abedin created SPARK-19776: -- Summary: JavaKafkaWordCount calls createStream on version 2.1 Key: SPARK-19776 URL: https://issues.apache.org/jira/browse/SPARK-19776 Project: Spark Issue Type: Bug Components: Examples, ML Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0 Reporter: Russell Abedin This bug is reported by http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19382) Test sparse vectors in LinearSVCSuite
[ https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19382: -- Shepherd: Joseph K. Bradley > Test sparse vectors in LinearSVCSuite > - > > Key: SPARK-19382 > URL: https://issues.apache.org/jira/browse/SPARK-19382 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, LinearSVCSuite does not test sparse vectors. We should. I > recommend that generateSVMInput be modified to create a mix of dense and > sparse vectors, rather than adding an additional test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14503. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15415 [https://github.com/apache/spark/pull/15415] > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang > Fix For: 2.2.0 > > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)
[ https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889120#comment-15889120 ] Adrian Bridgett commented on SPARK-18579: - Yep, that's right. Sorry - not sure why I didn't reply to your initial comment! > spark-csv strips whitespace (pyspark) > -- > > Key: SPARK-18579 > URL: https://issues.apache.org/jira/browse/SPARK-18579 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 >Reporter: Adrian Bridgett >Priority: Minor > > ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV > reader (and defaults to false). > However these are not supported options on the CSV writer and so the library > defaults take place which strips the whitespace. > I think it would make the most sense if the writer semantics matched the > reader (and did not alter the data) however this would be a change in > behaviour. In any case it'd be great to have the _option_ to strip or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)
[ https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889117#comment-15889117 ] Hyukjin Kwon commented on SPARK-18579: -- Oh, I overlooked. You meant it always strips the white spaces when writing out via CSV datasource? > spark-csv strips whitespace (pyspark) > -- > > Key: SPARK-18579 > URL: https://issues.apache.org/jira/browse/SPARK-18579 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 >Reporter: Adrian Bridgett >Priority: Minor > > ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV > reader (and defaults to false). > However these are not supported options on the CSV writer and so the library > defaults take place which strips the whitespace. > I think it would make the most sense if the writer semantics matched the > reader (and did not alter the data) however this would be a change in > behaviour. In any case it'd be great to have the _option_ to strip or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889114#comment-15889114 ] Hyukjin Kwon commented on SPARK-19713: -- Hi [~balaramraju] Could you update the title? > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19729) Strange behaviour with reading csv with schema into dataframe
[ https://issues.apache.org/jira/browse/SPARK-19729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19729. -- Resolution: Invalid I am resolving this as {{Invalid}}. Please reopen this if I was wrong with updating the JIRA description. I could not see what is a problem here. > Strange behaviour with reading csv with schema into dataframe > - > > Key: SPARK-19729 > URL: https://issues.apache.org/jira/browse/SPARK-19729 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.0.1 >Reporter: Mazen Melouk > > I have the following schema > [{first,string_type,false} > ,{second,string_type,false} > ,{third,string_type,false} > ,{fourth,string_type,false}] > Example lines: > var1,var2,, > when accessing the row I get the following > row.size =4 > row.fieldIndex(third_string)=2 > row.getAs(third_string)=var2 > row.get(2)=var2 > print(row)= var1,var2 > Any idea why the null values are missing? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16846) read.csv() option: "inferSchema" don't work
[ https://issues.apache.org/jira/browse/SPARK-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16846. -- Resolution: Not A Problem If the schema is given, it does not infer the schema. > read.csv() option: "inferSchema" don't work > - > > Key: SPARK-16846 > URL: https://issues.apache.org/jira/browse/SPARK-16846 > Project: Spark > Issue Type: Bug >Reporter: hejie > > I use the code to read file and get a dataframe. When the colum number is > olny 20, the inferSchema paragrama work well. But when the number is up to > 400, it doesn't work, and I have to tell it the schema manually. the code is : > val df = > spark.read.schema(schema).options(Map("header"->"true","quote"->",","inferSchema"->"true")).csv("/Users/ss/Documents/traindata/traindataAllNumber.csv") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889090#comment-15889090 ] Tejas Patil commented on SPARK-17495: - [~rxin]: >> 1. On the read side we shouldn't care which hash function to use. All we >> need to know is that the data is hash partitioned by some hash function, and >> that should be sufficient to remove the shuffle needed in aggregation or >> join. For joins, if one side is pre-hashed (due to bucketing) and other is not, then the non hashed side need to be shuffled with the _same_ hashing function as the pre-hashed one. Else, the output would be wrong. Or we can choose to shuffle both the sides but that wont utilize benefit of bucketing. > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores
[ https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19373. --- Resolution: Fixed Assignee: Michael Gummelt Fix Version/s: 2.2.0 > Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at > acquired cores rather than registerd cores > --- > > Key: SPARK-19373 > URL: https://issues.apache.org/jira/browse/SPARK-19373 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt > Fix For: 2.2.0 > > > We're currently using `totalCoresAcquired` to account for registered > resources, which is incorrect. That variable measures the number of cores > the scheduler has accepted. We should be using `totalCoreCount` like the > other schedulers do. > Fixing this is important for locality, since users often want to wait for all > executors to come up before scheduling tasks to ensure they get a node-local > placement. > original PR to add support: https://github.com/apache/spark/pull/8672/files -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores
[ https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889084#comment-15889084 ] Sean Owen commented on SPARK-19373: --- Resolved by https://github.com/apache/spark/pull/17045 > Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at > acquired cores rather than registerd cores > --- > > Key: SPARK-19373 > URL: https://issues.apache.org/jira/browse/SPARK-19373 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt > Fix For: 2.2.0 > > > We're currently using `totalCoresAcquired` to account for registered > resources, which is incorrect. That variable measures the number of cores > the scheduler has accepted. We should be using `totalCoreCount` like the > other schedulers do. > Fixing this is important for locality, since users often want to wait for all > executors to come up before scheduling tasks to ensure they get a node-local > placement. > original PR to add support: https://github.com/apache/spark/pull/8672/files -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889085#comment-15889085 ] Joseph K. Bradley commented on SPARK-14503: --- Sorry for the slow reply. I actually haven't read enough about PrefixSpan to say, and the original paper doesn't seem to cover rule generation. We're going with PrefixSpanModel currently. The benefit from combining the models seems pretty small given the unknowns here, and if they can share implementations, we can do so internally in the future. > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema
[ https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16512. -- Resolution: Duplicate > No way to load CSV data without dropping whole rows when some of data is not > matched with given schema > -- > > Key: SPARK-16512 > URL: https://issues.apache.org/jira/browse/SPARK-16512 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, there is no way to read CSV data without dropping whole rows when > some of data is not matched with given schema. > It seems there are some usecases as below: > {code} > a,b > 1,c > {code} > Here, {{a}} can be a dirty data in real usecases. > But codes below: > {code} > val path = "/tmp/test.csv" > val schema = StructType( > StructField("a", IntegerType, nullable = true) :: > StructField("b", StringType, nullable = true) :: Nil > val df = spark.read > .format("csv") > .option("mode", "PERMISSIVE") > .schema(schema) > .load(path) > df.show() > {code} > emits the exception below: > {code} > java.lang.NumberFormatException: For input string: "a" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) > {code} > With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an > exception. > FYI, this is not the case for JSON because JSON data sources can handle this > with {{PERMISSIVE}} mode as below: > {code} > val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}")) > val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil) > spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show() > {code} > {code} > ++ > | a| > ++ > | 1| > |null| > ++ > {code} > Please refer https://github.com/databricks/spark-csv/pull/298 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19769) Quickstart self-contained application instructions do not work with current sbt
[ https://issues.apache.org/jira/browse/SPARK-19769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19769: - Assignee: Michael McCune > Quickstart self-contained application instructions do not work with current > sbt > --- > > Key: SPARK-19769 > URL: https://issues.apache.org/jira/browse/SPARK-19769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael McCune >Assignee: Michael McCune >Priority: Trivial > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > The current quickstart instructions for the "Self-Contained Applications" > instructs the user to create a file named {{simple.sbt}} to instruct the > build tooling. With current versions of sbt(ie 1.0) however, the tooling does > not recognize the {{simple.sbt}} file as it is looking for a file named > {{build.sbt}}. > When following the quickstart instructions, I see the following output: > {noformat} > $ find . > . > ./simple.sbt > ./src > ./src/main > ./src/main/scala > ./src/main/scala/SimpleApp.scala > [mike@ultra] master ~/workspace/sandbox/SimpleApp > $ sbt package > /home/mike/workspace/sandbox/SimpleApp doesn't appear to be an sbt project. > If you want to start sbt anyway, run: > /home/mike/bin/sbt -sbt-create > {noformat} > Changing the filename to {{build.sbt}} produces a valid build: > {noformat} > $ mv simple.sbt build.sbt > [mike@ultra] master ~/workspace/sandbox/SimpleApp > $ sbt package > [info] Set current project to Simple Project (in build > file:/home/mike/workspace/sandbox/SimpleApp/) > [info] Updating {file:/home/mike/workspace/sandbox/SimpleApp/}simpleapp... > [info] Resolving jline#jline;2.12.1 ... > [info] Done updating. > [info] Compiling 1 Scala source to > /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/classes... > [info] Packaging > /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/simple-project_2.11-1.0.jar > ... > [info] Done packaging. > [success] Total time: 10 s, completed Feb 28, 2017 10:01:58 AM > {noformat} > I think the documentation just needs to be changed to reflect the new > filename. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19521) Error with embedded line break (multi-line record) in csv file.
[ https://issues.apache.org/jira/browse/SPARK-19521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19521. -- Resolution: Duplicate I am resolving this as a duplicate of SPARK-19610 as that one has a PR merged. > Error with embedded line break (multi-line record) in csv file. > --- > > Key: SPARK-19521 > URL: https://issues.apache.org/jira/browse/SPARK-19521 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: $ uname -a > Linux jupyter-test 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 > 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Ruslan Korniichuk > > Input csv file: > id,name,amount,isActive,Remark > 1,Barney & Company,0,,"Great to work with > and always pays with cash." > Output json file with spark-2.0.2-bin-hadoop2.7: > {"id":"1","name":"Barney & Company","amount":0,"Remark":"Great to work with > \nand always pays with cash."} > Error with spark-2.1.0-bin-hadoop2.7: > java.lang.RuntimeException: Malformed line in FAILFAST mode: and always pays > with cash." > 17/02/08 22:53:02 ERROR Utils: Aborting task > java.lang.RuntimeException: Malformed line in FAILFAST mode: and always pays > with cash." > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:106) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 17/02/08 22:53:02 ERROR FileFormatWriter: Job job_20170208225302_0003 aborted. > 17/02/08 22:53:02 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at
[jira] [Resolved] (SPARK-17225) Support multiple null values in csv files
[ https://issues.apache.org/jira/browse/SPARK-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17225. -- Resolution: Duplicate I am resolving this as a duplicate because that JIRA has a PR. > Support multiple null values in csv files > - > > Key: SPARK-17225 > URL: https://issues.apache.org/jira/browse/SPARK-17225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Since we're dealing with strings it's useful to have multiple different > representations of null values as data might not be fully normalized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell
[ https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-14194. -- Resolution: Duplicate I proposed to solve this via {{wholeFile}} option and it seems merged. I am resolving this as a duplicate of that as that one has a PR. > spark csv reader not working properly if CSV content contains CRLF character > (newline) in the intermediate cell > --- > > Key: SPARK-14194 > URL: https://issues.apache.org/jira/browse/SPARK-14194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.1.0 >Reporter: Kumaresh C R > > We have CSV content like below, > Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r > "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), > Municapality,","USA", "1234567" > Since there is a '\n\r' character in the row middle (to be exact in the > Address Column), when we execute the below spark code, it tries to create the > dataframe with two rows (excluding header row), which is wrong. Since we have > specified delimiter as quote (") character , why it takes the middle > character as newline character ? This creates an issue while processing the > created dataframe. > DataFrame df = > sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", delim) > .option("quote", quote) > .option("escape", escape) > .load(sourceFile); > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19769) Quickstart self-contained application instructions do not work with current sbt
[ https://issues.apache.org/jira/browse/SPARK-19769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19769. --- Resolution: Fixed Fix Version/s: 2.2.0 2.0.3 2.1.1 Issue resolved by pull request 17101 [https://github.com/apache/spark/pull/17101] > Quickstart self-contained application instructions do not work with current > sbt > --- > > Key: SPARK-19769 > URL: https://issues.apache.org/jira/browse/SPARK-19769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael McCune >Priority: Trivial > Fix For: 2.1.1, 2.0.3, 2.2.0 > > > The current quickstart instructions for the "Self-Contained Applications" > instructs the user to create a file named {{simple.sbt}} to instruct the > build tooling. With current versions of sbt(ie 1.0) however, the tooling does > not recognize the {{simple.sbt}} file as it is looking for a file named > {{build.sbt}}. > When following the quickstart instructions, I see the following output: > {noformat} > $ find . > . > ./simple.sbt > ./src > ./src/main > ./src/main/scala > ./src/main/scala/SimpleApp.scala > [mike@ultra] master ~/workspace/sandbox/SimpleApp > $ sbt package > /home/mike/workspace/sandbox/SimpleApp doesn't appear to be an sbt project. > If you want to start sbt anyway, run: > /home/mike/bin/sbt -sbt-create > {noformat} > Changing the filename to {{build.sbt}} produces a valid build: > {noformat} > $ mv simple.sbt build.sbt > [mike@ultra] master ~/workspace/sandbox/SimpleApp > $ sbt package > [info] Set current project to Simple Project (in build > file:/home/mike/workspace/sandbox/SimpleApp/) > [info] Updating {file:/home/mike/workspace/sandbox/SimpleApp/}simpleapp... > [info] Resolving jline#jline;2.12.1 ... > [info] Done updating. > [info] Compiling 1 Scala source to > /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/classes... > [info] Packaging > /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/simple-project_2.11-1.0.jar > ... > [info] Done packaging. > [success] Total time: 10 s, completed Feb 28, 2017 10:01:58 AM > {noformat} > I think the documentation just needs to be changed to reflect the new > filename. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889067#comment-15889067 ] Ari Gesher edited comment on SPARK-19764 at 2/28/17 11:04 PM: -- That was the log in the application directory on the driver machine. The other log was from the SPARK_LOG_DIR on one of the workers (*.112, as referenced in the log snippets and WebUI in the body of the comments) We're working on a repro to get you a stack trace. was (Author: agesher): That was the log in the application directory on the driver machine. We're working on a repro to get you a stack trace. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17224) Support skipping multiple header rows in csv
[ https://issues.apache.org/jira/browse/SPARK-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17224. -- Resolution: Duplicate Now multiple header line can be dealt with by {{wholeFile}} option. Let me resolve this as a duplicate for now. > Support skipping multiple header rows in csv > > > Key: SPARK-17224 > URL: https://issues.apache.org/jira/browse/SPARK-17224 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Headers can be multiline and sometimes you want to skip multiple rows because > of the format you've been given -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889067#comment-15889067 ] Ari Gesher commented on SPARK-19764: That was the log in the application directory on the driver machine. We're working on a repro to get you a stack trace. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16102) Use Record API from Univocity rather than current data cast API.
[ https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889058#comment-15889058 ] Hyukjin Kwon commented on SPARK-16102: -- Yes, let me check out this API and other APIs too. Let me try to update this JIRA within this week. > Use Record API from Univocity rather than current data cast API. > > > Key: SPARK-16102 > URL: https://issues.apache.org/jira/browse/SPARK-16102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > There is Record API for Univocity parser. > This API provides typed data. Spark currently tries to compare and cast each > data. > Using this library should reduce the codes in Spark and maybe improve the > performance. > It seems a benchmark should be proceeded first. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16103) Share a single Row for CSV data source rather than creating every time
[ https://issues.apache.org/jira/browse/SPARK-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16103. -- Resolution: Duplicate yup, fixed in https://github.com/apache/spark/pull/16669 > Share a single Row for CSV data source rather than creating every time > -- > > Key: SPARK-16103 > URL: https://issues.apache.org/jira/browse/SPARK-16103 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source is creating each Row. > Single row can be shared just like the other data sources. > It is a bit not related but it might be great if some improvements were made > in term of performances. This was suggested in > https://github.com/apache/spark/pull/12268. > - https://github.com/apache/spark/pull/12268#discussion_r61265122 > - https://github.com/apache/spark/pull/12268#discussion_r61270381 > - https://github.com/apache/spark/pull/12268#discussion_r61271158 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18389) Disallow cyclic view reference
[ https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889047#comment-15889047 ] Wenchen Fan commented on SPARK-18389: - [~jiangxb1987] are you working on it? > Disallow cyclic view reference > -- > > Key: SPARK-18389 > URL: https://issues.apache.org/jira/browse/SPARK-18389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The following should not be allowed: > {code} > CREATE VIEW testView AS SELECT id FROM jt > CREATE VIEW testView2 AS SELECT id FROM testView > ALTER VIEW testView AS SELECT * FROM testView2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert
[ https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889048#comment-15889048 ] Wenchen Fan commented on SPARK-19211: - Hi [~jiangxb1987] are you working on it? > Explicitly prevent Insert into View or Create View As Insert > > > Key: SPARK-19211 > URL: https://issues.apache.org/jira/browse/SPARK-19211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo > > Currently we don't explicitly forbid the following behaviors: > 1. The statement CREATE VIEW AS INSERT INTO throws the following exception > from SQLBuilder: > `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable > MetastoreRelation default, tbl, false, false`; > 2. The statement INSERT INTO view VALUES throws the following exception from > checkAnalysis: > `Error in query: Inserting into an RDD-based table is not allowed.;;` > We should check for these behaviors earlier and explicitly prevent them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case
[ https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19775: -- Description: This issue removes [a test case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] which was introduced by [SPARK-14459|https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e] and was superseded by [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. Basically, we cannot use `partitionBy` and `insertInto` together. {code} test("Reject partitioning that does not match table") { withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)") val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd")) .toDF("id", "data", "part") intercept[AnalysisException] { // cannot partition by 2 fields when there is only one in the table definition data.write.partitionBy("part", "data").insertInto("partitioned") } } } {code} was: This issue removes [a test case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] which was introduced by [SPARK-14459|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] and was superseded by [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. Basically, we cannot use `partitionBy` and `insertInto` together. {code} test("Reject partitioning that does not match table") { withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)") val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd")) .toDF("id", "data", "part") intercept[AnalysisException] { // cannot partition by 2 fields when there is only one in the table definition data.write.partitionBy("part", "data").insertInto("partitioned") } } } {code} > Remove an obsolete `partitionBy().insertInto()` test case > - > > Key: SPARK-19775 > URL: https://issues.apache.org/jira/browse/SPARK-19775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue removes [a test > case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] > which was introduced by > [SPARK-14459|https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e] > and was superseded by > [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. > Basically, we cannot use `partitionBy` and `insertInto` together. > {code} > test("Reject partitioning that does not match table") { > withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { > sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY > (part string)") > val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" > else "odd")) > .toDF("id", "data", "part") > intercept[AnalysisException] { > // cannot partition by 2 fields when there is only one in the table > definition > data.write.partitionBy("part", "data").insertInto("partitioned") > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16103) Share a single Row for CSV data source rather than creating every time
[ https://issues.apache.org/jira/browse/SPARK-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889042#comment-15889042 ] Wenchen Fan commented on SPARK-16103: - seems it's already fixed? > Share a single Row for CSV data source rather than creating every time > -- > > Key: SPARK-16103 > URL: https://issues.apache.org/jira/browse/SPARK-16103 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source is creating each Row. > Single row can be shared just like the other data sources. > It is a bit not related but it might be great if some improvements were made > in term of performances. This was suggested in > https://github.com/apache/spark/pull/12268. > - https://github.com/apache/spark/pull/12268#discussion_r61265122 > - https://github.com/apache/spark/pull/12268#discussion_r61270381 > - https://github.com/apache/spark/pull/12268#discussion_r61271158 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16103) Share a single Row for CSV data source rather than creating every time
[ https://issues.apache.org/jira/browse/SPARK-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889038#comment-15889038 ] Wenchen Fan commented on SPARK-16103: - Hi [~hyukjin.kwon] are you working on it? > Share a single Row for CSV data source rather than creating every time > -- > > Key: SPARK-16103 > URL: https://issues.apache.org/jira/browse/SPARK-16103 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source is creating each Row. > Single row can be shared just like the other data sources. > It is a bit not related but it might be great if some improvements were made > in term of performances. This was suggested in > https://github.com/apache/spark/pull/12268. > - https://github.com/apache/spark/pull/12268#discussion_r61265122 > - https://github.com/apache/spark/pull/12268#discussion_r61270381 > - https://github.com/apache/spark/pull/12268#discussion_r61271158 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16102) Use Record API from Univocity rather than current data cast API.
[ https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889036#comment-15889036 ] Wenchen Fan commented on SPARK-16102: - Hi [~hyukjin.kwon] are you working on it? > Use Record API from Univocity rather than current data cast API. > > > Key: SPARK-16102 > URL: https://issues.apache.org/jira/browse/SPARK-16102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > There is Record API for Univocity parser. > This API provides typed data. Spark currently tries to compare and cast each > data. > Using this library should reduce the codes in Spark and maybe improve the > performance. > It seems a benchmark should be proceeded first. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails
[ https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19774: Assignee: Burak Yavuz (was: Apache Spark) > StreamExecution should call stop() on sources when a stream fails > - > > Key: SPARK-19774 > URL: https://issues.apache.org/jira/browse/SPARK-19774 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > We call stop() on a Structured Streaming Source only when the stream is > shutdown when a user calls streamingQuery.stop(). We should actually stop all > sources when the stream fails as well, otherwise we may leak resources, e.g. > connections to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails
[ https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19774: Assignee: Apache Spark (was: Burak Yavuz) > StreamExecution should call stop() on sources when a stream fails > - > > Key: SPARK-19774 > URL: https://issues.apache.org/jira/browse/SPARK-19774 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > > We call stop() on a Structured Streaming Source only when the stream is > shutdown when a user calls streamingQuery.stop(). We should actually stop all > sources when the stream fails as well, otherwise we may leak resources, e.g. > connections to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case
[ https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19775: -- Description: This issue removes [a test case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] which was introduced by [SPARK-14459|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] and was superseded by [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. Basically, we cannot use `partitionBy` and `insertInto` together. {code} test("Reject partitioning that does not match table") { withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)") val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd")) .toDF("id", "data", "part") intercept[AnalysisException] { // cannot partition by 2 fields when there is only one in the table definition data.write.partitionBy("part", "data").insertInto("partitioned") } } } {code} was: This issue removes [a test case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] which was introduced by [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] and was superseded by [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. Basically, we cannot use `partitionBy` and `insertInto` together. {code} test("Reject partitioning that does not match table") { withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)") val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd")) .toDF("id", "data", "part") intercept[AnalysisException] { // cannot partition by 2 fields when there is only one in the table definition data.write.partitionBy("part", "data").insertInto("partitioned") } } } {code} > Remove an obsolete `partitionBy().insertInto()` test case > - > > Key: SPARK-19775 > URL: https://issues.apache.org/jira/browse/SPARK-19775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue removes [a test > case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] > which was introduced by > [SPARK-14459|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] > and was superseded by > [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. > Basically, we cannot use `partitionBy` and `insertInto` together. > {code} > test("Reject partitioning that does not match table") { > withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { > sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY > (part string)") > val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" > else "odd")) > .toDF("id", "data", "part") > intercept[AnalysisException] { > // cannot partition by 2 fields when there is only one in the table > definition > data.write.partitionBy("part", "data").insertInto("partitioned") > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889026#comment-15889026 ] Shixiong Zhu commented on SPARK-19764: -- [~agesher] driver-log-stderr.log is actually the executor log. Did you upload a wrong file? > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails
[ https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889032#comment-15889032 ] Apache Spark commented on SPARK-19774: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/17107 > StreamExecution should call stop() on sources when a stream fails > - > > Key: SPARK-19774 > URL: https://issues.apache.org/jira/browse/SPARK-19774 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > We call stop() on a Structured Streaming Source only when the stream is > shutdown when a user calls streamingQuery.stop(). We should actually stop all > sources when the stream fails as well, otherwise we may leak resources, e.g. > connections to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889030#comment-15889030 ] Wenchen Fan commented on SPARK-14480: - The regression has been fixed in https://issues.apache.org/jira/browse/SPARK-19610 > Remove meaningless StringIteratorReader for CSV data source for better > performance > -- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 100. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(100)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case
[ https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889028#comment-15889028 ] Apache Spark commented on SPARK-19775: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17106 > Remove an obsolete `partitionBy().insertInto()` test case > - > > Key: SPARK-19775 > URL: https://issues.apache.org/jira/browse/SPARK-19775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue removes [a test > case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] > which was introduced by > [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] > and was superseded by > [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. > Basically, we cannot use `partitionBy` and `insertInto` together. > {code} > test("Reject partitioning that does not match table") { > withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { > sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY > (part string)") > val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" > else "odd")) > .toDF("id", "data", "part") > intercept[AnalysisException] { > // cannot partition by 2 fields when there is only one in the table > definition > data.write.partitionBy("part", "data").insertInto("partitioned") > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case
[ https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19775: Assignee: (was: Apache Spark) > Remove an obsolete `partitionBy().insertInto()` test case > - > > Key: SPARK-19775 > URL: https://issues.apache.org/jira/browse/SPARK-19775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue removes [a test > case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] > which was introduced by > [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] > and was superseded by > [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. > Basically, we cannot use `partitionBy` and `insertInto` together. > {code} > test("Reject partitioning that does not match table") { > withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { > sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY > (part string)") > val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" > else "odd")) > .toDF("id", "data", "part") > intercept[AnalysisException] { > // cannot partition by 2 fields when there is only one in the table > definition > data.write.partitionBy("part", "data").insertInto("partitioned") > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case
[ https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19775: Assignee: Apache Spark > Remove an obsolete `partitionBy().insertInto()` test case > - > > Key: SPARK-19775 > URL: https://issues.apache.org/jira/browse/SPARK-19775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > > This issue removes [a test > case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298] > which was introduced by > [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a] > and was superseded by > [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371]. > Basically, we cannot use `partitionBy` and `insertInto` together. > {code} > test("Reject partitioning that does not match table") { > withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { > sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY > (part string)") > val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" > else "odd")) > .toDF("id", "data", "part") > intercept[AnalysisException] { > // cannot partition by 2 fields when there is only one in the table > definition > data.write.partitionBy("part", "data").insertInto("partitioned") > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-14480. - Resolution: Fixed > Remove meaningless StringIteratorReader for CSV data source for better > performance > -- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 100. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(100)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org