[jira] [Commented] (SPARK-19778) alais cannot use in group by

2017-02-28 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889700#comment-15889700
 ] 

Takeshi Yamamuro commented on SPARK-19778:
--

{code}
scala> Seq(("a", 0), ("b", 1)).toDF("key", "value").createOrReplaceTempView("t")
scala> sql("SELECT key AS key1 FROM t GROUP BY key1")
org.apache.spark.sql.AnalysisException: cannot resolve '`key1`' given input 
columns: [key, value]; line 1 pos 35;
'Aggregate ['key1], [key#15 AS key1#21]
+- SubqueryAlias t
   +- Project [_1#12 AS key#15, _2#13 AS value#16]
  +- LocalRelation [_1#12, _2#13]

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:75)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql
{code}

Aha, it makes some sense to me because, for example, postgresql accepts this 
query;
{code}

postgres=# create table t(key INT, value INT);
CREATE TABLE

postgres=# insert into t values(1, 0);
INSERT 0 1

postgres=# SELECT key AS key1 FROM t GROUP BY key1;
 key1 
--
1
(1 row)
{code}

> alais cannot use in group by
> 
>
> Key: SPARK-19778
> URL: https://issues.apache.org/jira/browse/SPARK-19778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: xukun
>
> not support “select key as key1 from src group by key1”



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"

2017-02-28 Thread Amit Baghel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889698#comment-15889698
 ] 

Amit Baghel commented on SPARK-19768:
-

I am using aggregate query with format="parquet" and outputMode="append" but it 
is throwing "Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Append output mode not supported when there are streaming aggregations on 
streaming DataFrames/DataSets;;"

> Error for both  aggregate  and  non-aggregate queries in Structured Streaming 
> - "This query does not support recovering from checkpoint location"
> -
>
> Key: SPARK-19768
> URL: https://issues.apache.org/jira/browse/SPARK-19768
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Amit Baghel
>
> I am running JavaStructuredKafkaWordCount.java example with 
> checkpointLocation. Output mode is "complete". Below is relevant code.
> {code}
>  // Generate running word count
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).groupBy("value").count();
> // Start running the query that prints the running counts to the console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("complete")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This example runs successfully and writes data in checkpoint directory. When 
> I re-run the program it throws below exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}
> Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query 
> with output mode as "append". Please see the code below.
> {code}
> // no aggregations
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).select("value");
> // append mode with console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("append")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This modified code runs successfully and writes data in checkpoint directory. 
> When I re-run the program it throws same exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2017-02-28 Thread Shridhar Ramachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889677#comment-15889677
 ] 

Shridhar Ramachandran commented on SPARK-5159:
--

I have faced this issue as well, on both 1.6 and 2.0. Some solutions have 
indicated setting hive.metastore.execute.setugi to true on the metastore as 
well as the thrift server, but this did not help.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2017-02-28 Thread Shridhar Ramachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889677#comment-15889677
 ] 

Shridhar Ramachandran edited comment on SPARK-5159 at 3/1/17 7:29 AM:
--

I am facing this issue as well, on both 1.6 and 2.0. Some solutions have 
indicated setting hive.metastore.execute.setugi to true on the metastore as 
well as the thrift server, but this did not help.


was (Author: shridharama):
I have faced this issue as well, on both 1.6 and 2.0. Some solutions have 
indicated setting hive.metastore.execute.setugi to true on the metastore as 
well as the thrift server, but this did not help.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889654#comment-15889654
 ] 

Apache Spark commented on SPARK-19758:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17114

> Casting string to timestamp in inline table definition fails with 
> AnalysisException
> ---
>
> Key: SPARK-19758
> URL: https://issues.apache.org/jira/browse/SPARK-19758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> The following query runs succesfully on Spark 2.1.x but fails in the current 
> master:
> {code}
> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> {code}
> Here's the error:
> {code}
> scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> org.apache.spark.sql.AnalysisException: failed to evaluate expression 
> CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128)
>   at 
> 

[jira] [Assigned] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19758:


Assignee: (was: Apache Spark)

> Casting string to timestamp in inline table definition fails with 
> AnalysisException
> ---
>
> Key: SPARK-19758
> URL: https://issues.apache.org/jira/browse/SPARK-19758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> The following query runs succesfully on Spark 2.1.x but fails in the current 
> master:
> {code}
> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> {code}
> Here's the error:
> {code}
> scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> org.apache.spark.sql.AnalysisException: failed to evaluate expression 
> CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> 

[jira] [Assigned] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19758:


Assignee: Apache Spark

> Casting string to timestamp in inline table definition fails with 
> AnalysisException
> ---
>
> Key: SPARK-19758
> URL: https://issues.apache.org/jira/browse/SPARK-19758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> The following query runs succesfully on Spark 2.1.x but fails in the current 
> master:
> {code}
> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> {code}
> Here's the error:
> {code}
> scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> org.apache.spark.sql.AnalysisException: failed to evaluate expression 
> CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> 

[jira] [Resolved] (SPARK-19633) FileSource read from FileSink

2017-02-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19633.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.2.0

> FileSource read from FileSink
> -
>
> Key: SPARK-19633
> URL: https://issues.apache.org/jira/browse/SPARK-19633
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Liwei Lin
>Priority: Critical
> Fix For: 2.2.0
>
>
> Right now, you can't start a streaming query from a location that is being 
> written to by the file sink.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-28 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889645#comment-15889645
 ] 

Tejas Patil commented on SPARK-17495:
-

>> Is it possible to figure out the hashing function based on file names? 

The way datasource files are named is different than hive so this will work. I 
was thinking about more simpler way: use hive-hash only when writing to hive 
bucketed tables. Since Spark doesn't support hive bucketing at the moment, any 
old data would be generated has to be from Hive  so, this will not cause 
breakages for users.

>> 3. In general it'd be useful to allow users to configure which actual hash 
>> function "hash" maps to. This can be a dynamic config.

For any operations related to hive bucketed tables, we should not let users be 
able to change the hashing function and do the right thing underneath. Else, 
users can shoot themselves in foot (eg. joining two hive tables both bucketed 
but using different hashing function). One option was to store the hashing 
function used to populate a table in metastore. But this won't be compatible 
with Hive and mess things in environments where both Spark and Hive is used 
together.

As for simple `hash()` UDF / function is concerned, I am a bit conservative 
about adding a dynamic config as I feel it might cause problems. Say you start 
off a session with default murmur3 hash and compute some data, cache it. Later 
on user switches to use hive hash and reusing the cached data as-is now wont be 
right thing to do. Keeping it static for a session would save from such problem.

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"

2017-02-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889639#comment-15889639
 ] 

Shixiong Zhu commented on SPARK-19768:
--

It should work for both aggregate and non-aggregate queries, but it only 
supports "append" mode.

> Error for both  aggregate  and  non-aggregate queries in Structured Streaming 
> - "This query does not support recovering from checkpoint location"
> -
>
> Key: SPARK-19768
> URL: https://issues.apache.org/jira/browse/SPARK-19768
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Amit Baghel
>
> I am running JavaStructuredKafkaWordCount.java example with 
> checkpointLocation. Output mode is "complete". Below is relevant code.
> {code}
>  // Generate running word count
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).groupBy("value").count();
> // Start running the query that prints the running counts to the console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("complete")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This example runs successfully and writes data in checkpoint directory. When 
> I re-run the program it throws below exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}
> Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query 
> with output mode as "append". Please see the code below.
> {code}
> // no aggregations
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).select("value");
> // append mode with console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("append")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This modified code runs successfully and writes data in checkpoint directory. 
> When I re-run the program it throws same exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19752) OrcGetSplits fails with 0 size files

2017-02-28 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889616#comment-15889616
 ] 

Liang-Chi Hsieh commented on SPARK-19752:
-

>From the log, looks like it is a problem in Hive?

> OrcGetSplits fails with 0 size files
> 
>
> Key: SPARK-19752
> URL: https://issues.apache.org/jira/browse/SPARK-19752
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Nick Orka
>
> There is a possibility that during some sql queries a partition may have a 0 
> size file (empty file). Next time when I try to read from the file by sql 
> query, I'm getting this error:
> 17/02/27 10:33:11 INFO PerfLogger:  start=1488191591570 end=1488191591599 duration=29 
> from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
> 17/02/27 10:33:11 ERROR ApplicationMaster: User class threw exception: 
> java.lang.reflect.InvocationTargetException
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror1.jinvokeraw(JavaMirrors.scala:373)
>   at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$JavaMethodMirror.jinvoke(JavaMirrors.scala:339)
>   at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror.apply(JavaMirrors.scala:355)
>   at com.sessionm.Datapipeline$.main(Datapipeline.scala:200)
>   at com.sessionm.Datapipeline.main(Datapipeline.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
> Caused by: java.lang.RuntimeException: serious problem
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
>   at 
> scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115)
>   at 
> scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62)
>   at 
> scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
>   at 
> scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:169)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
>   at 
> 

[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889611#comment-15889611
 ] 

Shixiong Zhu commented on SPARK-19764:
--

These are master and workers. From the master log, you are using pyspark with 
the client mode. The driver logs should just output to the console. Could you 
paste the output of pyspark shell?

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19460.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2017-02-28 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889588#comment-15889588
 ] 

Zheng Shao commented on SPARK-6951:
---

Did we consider using a distributed store to solve the scalability issue?  
Single-node LevelDB will hit scalability issue soon.


> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19572) Allow to disable hive in sparkR shell

2017-02-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19572.
--
  Resolution: Fixed
Assignee: Jeff Zhang
   Fix Version/s: 2.2.0
  2.1.1
Target Version/s: 2.1.1, 2.2.0

> Allow to disable hive in sparkR shell
> -
>
> Key: SPARK-19572
> URL: https://issues.apache.org/jira/browse/SPARK-19572
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> SPARK-15236 do this for scala shell, this ticket is for sparkR shell.  This 
> is not only for sparkR itself, but can also benefit downstream project like 
> livy which use shell.R for its interactive session. For now, livy has no 
> control of whether enable hive or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889566#comment-15889566
 ] 

Ari Gesher commented on SPARK-19764:


And here's the stuck Executor:

{noformat}
Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode):

"shuffle-server-1" #30 daemon prio=5 os_prio=0 tid=0x7fdf9400d800 nid=0xd22 
runnable [0x7fdfc4726000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc014b990> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc014da10> (a java.util.Collections$UnmodifiableSet)
- locked <0xc014b8f8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"File appending thread for /home/ubuntu/app-20170228223629-/1/stderr" #56 
daemon prio=5 os_prio=0 tid=0x7fdf7803b000 nid=0xce7 runnable 
[0x7fdfc4827000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0xf16a3010> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply$mcV$sp(FileAppender.scala:68)
at 
org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62)
at 
org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:78)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

"File appending thread for /home/ubuntu/app-20170228223629-/1/stdout" #55 
daemon prio=5 os_prio=0 tid=0x7fdf78036800 nid=0xcde runnable 
[0x7fdfc4e2b000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0xf16a0f50> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply$mcV$sp(FileAppender.scala:68)
at 
org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62)
at 
org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:78)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

"process reaper" #54 daemon prio=10 os_prio=0 tid=0x7fdf78032800 nid=0xcdc 
runnable [0x7fdfc6de2000]
   java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.waitForProcessExit(Native Method)
at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:289)

[jira] [Comment Edited] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889564#comment-15889564
 ] 

Ari Gesher edited comment on SPARK-19764 at 3/1/17 6:05 AM:


Nothing like that.  Full logs in the attached tarball.  Here's the stack trace 
from master:

{noformat}
Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode):

"shuffle-server-7" #36 daemon prio=5 os_prio=0 tid=0x7f0764019800 nid=0xcd7 
runnable [0x7f0720684000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc0014950> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc00169d0> (a java.util.Collections$UnmodifiableSet)
- locked <0xc00148a8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-6" #35 daemon prio=5 os_prio=0 tid=0x7f0764017800 nid=0xb6a 
runnable [0x7f07218f7000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc0017198> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc0019218> (a java.util.Collections$UnmodifiableSet)
- locked <0xc0017100> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-5" #34 daemon prio=5 os_prio=0 tid=0x7f0764015800 nid=0xb69 
runnable [0x7f07219f8000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc00199e0> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc01eee80> (a java.util.Collections$UnmodifiableSet)
- locked <0xc0019948> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-4" #33 daemon prio=5 os_prio=0 tid=0x7f0764014000 nid=0xb1f 
runnable [0x7f0721af9000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc01ef648> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc01f16c8> (a java.util.Collections$UnmodifiableSet)
- locked <0xc01ef5b0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-3" #32 daemon prio=5 os_prio=0 tid=0x7f0764012000 nid=0xb0d 
runnable [0x7f0721786000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at 

[jira] [Commented] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889562#comment-15889562
 ] 

Apache Spark commented on SPARK-13669:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17113

> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the actual problem is Spark’s scheduler is not aware of the 
> unavailability of external shuffle service, and will still assign the tasks 
> on to that nodes. The fix is to avoid assigning tasks on to that nodes.
> Currently in the Spark, one related configuration is 
> “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
> worked in this scenario. This configuration is used to avoid same reattempt 
> task to run on the same executor. Also ways like MapReduce’s blacklist 
> mechanism may not handle this scenario, since all the reduce tasks will be 
> failed, so counting the failure tasks will equally mark all the executors as 
> “bad” one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13669:


Assignee: Apache Spark

> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the actual problem is Spark’s scheduler is not aware of the 
> unavailability of external shuffle service, and will still assign the tasks 
> on to that nodes. The fix is to avoid assigning tasks on to that nodes.
> Currently in the Spark, one related configuration is 
> “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
> worked in this scenario. This configuration is used to avoid same reattempt 
> task to run on the same executor. Also ways like MapReduce’s blacklist 
> mechanism may not handle this scenario, since all the reduce tasks will be 
> failed, so counting the failure tasks will equally mark all the executors as 
> “bad” one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13669:


Assignee: (was: Apache Spark)

> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the actual problem is Spark’s scheduler is not aware of the 
> unavailability of external shuffle service, and will still assign the tasks 
> on to that nodes. The fix is to avoid assigning tasks on to that nodes.
> Currently in the Spark, one related configuration is 
> “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
> worked in this scenario. This configuration is used to avoid same reattempt 
> task to run on the same executor. Also ways like MapReduce’s blacklist 
> mechanism may not handle this scenario, since all the reduce tasks will be 
> failed, so counting the failure tasks will equally mark all the executors as 
> “bad” one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889564#comment-15889564
 ] 

Ari Gesher commented on SPARK-19764:


Nothing like that.  Full logs in the attached tarball.  Here's the stack trace 
from a stuck executor:

{noformat}
Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode):

"shuffle-server-7" #36 daemon prio=5 os_prio=0 tid=0x7f0764019800 nid=0xcd7 
runnable [0x7f0720684000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc0014950> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc00169d0> (a java.util.Collections$UnmodifiableSet)
- locked <0xc00148a8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-6" #35 daemon prio=5 os_prio=0 tid=0x7f0764017800 nid=0xb6a 
runnable [0x7f07218f7000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc0017198> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc0019218> (a java.util.Collections$UnmodifiableSet)
- locked <0xc0017100> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-5" #34 daemon prio=5 os_prio=0 tid=0x7f0764015800 nid=0xb69 
runnable [0x7f07219f8000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc00199e0> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc01eee80> (a java.util.Collections$UnmodifiableSet)
- locked <0xc0019948> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-4" #33 daemon prio=5 os_prio=0 tid=0x7f0764014000 nid=0xb1f 
runnable [0x7f0721af9000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0xc01ef648> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0xc01f16c8> (a java.util.Collections$UnmodifiableSet)
- locked <0xc01ef5b0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

"shuffle-server-3" #32 daemon prio=5 os_prio=0 tid=0x7f0764012000 nid=0xb0d 
runnable [0x7f0721786000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at 

[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Ari Gesher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher updated SPARK-19764:
---
Attachment: SPARK-19764.tgz

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19779) structured streaming exist residual tmp file

2017-02-28 Thread Feng Gui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Gui updated SPARK-19779:
-
Description: The PR (https://github.com/apache/spark/pull/17012) can to fix 
restart a Structured Streaming application using hdfs as fileSystem, but also 
exist a problem that a tmp file of delta file is still reserved in hdfs. And 
Structured Streaming don't delete the tmp file generated when restart streaming 
job in future, so we need to delete the tmp file after restart streaming job.  
(was: The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
Structured Streaming application using hdfs as fileSystem, but that exist an 
problem that an tmp file of delta file is still reserved in hdfs. And 
Structured Streaming don't delete the tmp file which generated when restart 
streaming job.)

> structured streaming exist residual tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19779) structured streaming exist residual tmp file

2017-02-28 Thread Feng Gui (JIRA)
Feng Gui created SPARK-19779:


 Summary: structured streaming exist residual tmp file 
 Key: SPARK-19779
 URL: https://issues.apache.org/jira/browse/SPARK-19779
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Feng Gui
Priority: Minor


The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
Structured Streaming application using hdfs as fileSystem, but that exist an 
problem that an tmp file of delta file is still reserved in hdfs. And 
Structured Streaming don't delete the tmp file which generated when restart 
streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19776) Is the JavaKafkaWordCount example correct for Spark version 2.1?

2017-02-28 Thread Russell Abedin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Abedin updated SPARK-19776:
---
Description: 
My question is 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 correct?

I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 looked to be perfect.

However, when I tried running it, I found a couple of issues that I needed to 
overcome.

1. This line was unnecessary:
{code}
StreamingExamples.setStreamingLogLevels();
{code}

Having this line in there (and the associated import) caused me to go looking 
for a dependency spark-examples_2.10 which of no real use to me.

2. After running it, this line: 
{code}
JavaPairReceiverInputDStream messages = 
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
{code}

Appeared to throw an error around logging:
{code}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11
at 
org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
at main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72)
{code}

To get around this, I found that the code sample in 
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 
helped me to come up with the right lines to see streaming from Kafka in 
action. Specifically this called createDirectStream instead of createStream.

So is the example in 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 or is there something I could have done differently to get that example 
working?

  was:
My question is 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 correct?

I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 looked to be perfect.

However, when I tried running it, I found a couple of issues that I needed to 
overcome.

1. This line was unnecessary:
{code}
StreamingExamples.setStreamingLogLevels();
{code}

Having this line in there (and the associated import) caused me to go looking 
for a dependency spark-examples_2.10 which of no real use to me.

2. After running it, I this line: 
{code}
JavaPairReceiverInputDStream messages = 
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
{code}

Appeared to throw an error around logging:
{code}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
at 

[jira] [Updated] (SPARK-19776) Is the JavaKafkaWordCount example correct for Spark version 2.1?

2017-02-28 Thread Russell Abedin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Abedin updated SPARK-19776:
---
Summary: Is the JavaKafkaWordCount example correct for Spark version 2.1?  
(was: Is the JavaKafkaWordCount correct on Spark version 2.1)

> Is the JavaKafkaWordCount example correct for Spark version 2.1?
> 
>
> Key: SPARK-19776
> URL: https://issues.apache.org/jira/browse/SPARK-19776
> Project: Spark
>  Issue Type: Question
>  Components: Examples, ML
>Affects Versions: 2.1.0
>Reporter: Russell Abedin
>
> My question is 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  correct?
> I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
> Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  looked to be perfect.
> However, when I tried running it, I found a couple of issues that I needed to 
> overcome.
> 1. This line was unnecessary:
> {code}
> StreamingExamples.setStreamingLogLevels();
> {code}
> Having this line in there (and the associated import) caused me to go looking 
> for a dependency spark-examples_2.10 which of no real use to me.
> 2. After running it, I this line: 
> {code}
> JavaPairReceiverInputDStream messages = 
> KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
> {code}
> Appeared to throw an error around logging:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/Logging
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11
> at 
> org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
> at 
> main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72)
> {code}
> To get around this, I found that the code sample in 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 
> helped me to come up with the right lines to see streaming from Kafka in 
> action. Specifically this called createDirectStream instead of createStream.
> So is the example in 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  or is there something I could have done differently to get that example 
> working?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"

2017-02-28 Thread Amit Baghel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889472#comment-15889472
 ] 

Amit Baghel edited comment on SPARK-19768 at 3/1/17 4:14 AM:
-

Thanks [~zsxwing] for clarification. Documentation for structured streaming is 
missing this piece of information. The error thrown in case of console sink 
with checkpoint should be more meaningful. I have one more question. Does file 
sink using "parquet" and checkpoint work only for non-aggregate query? I tried 
this for both aggregate and non-aggregate queries and I am getting exception 
for aggregate query. 


was (Author: baghelamit):
Thanks [~zsxwing] for clarification. Documentation for structured streaming 
missing this piece of information. The error thrown in case of console sink 
with checkpoint should be more meaningful. I have one more question. Does file 
sink using "parquet" and checkpoint work only for non-aggregate query? I tried 
this for both aggregate and non-aggregate queries and I am getting exception 
for aggregate query. 

> Error for both  aggregate  and  non-aggregate queries in Structured Streaming 
> - "This query does not support recovering from checkpoint location"
> -
>
> Key: SPARK-19768
> URL: https://issues.apache.org/jira/browse/SPARK-19768
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Amit Baghel
>
> I am running JavaStructuredKafkaWordCount.java example with 
> checkpointLocation. Output mode is "complete". Below is relevant code.
> {code}
>  // Generate running word count
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).groupBy("value").count();
> // Start running the query that prints the running counts to the console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("complete")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This example runs successfully and writes data in checkpoint directory. When 
> I re-run the program it throws below exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}
> Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query 
> with output mode as "append". Please see the code below.
> {code}
> // no aggregations
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).select("value");
> // append mode with console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("append")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This modified code runs successfully and writes data in checkpoint directory. 
> When I re-run the program it throws same exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"

2017-02-28 Thread Amit Baghel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889472#comment-15889472
 ] 

Amit Baghel commented on SPARK-19768:
-

Thanks [~zsxwing] for clarification. Documentation for structured streaming 
missing this piece of information. The error thrown in case of console sink 
with checkpoint should be more meaningful. I have one more question. Does file 
sink using "parquet" and checkpoint work only for non-aggregate query? I tried 
this for both aggregate and non-aggregate queries and I am getting exception 
for aggregate query. 

> Error for both  aggregate  and  non-aggregate queries in Structured Streaming 
> - "This query does not support recovering from checkpoint location"
> -
>
> Key: SPARK-19768
> URL: https://issues.apache.org/jira/browse/SPARK-19768
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Amit Baghel
>
> I am running JavaStructuredKafkaWordCount.java example with 
> checkpointLocation. Output mode is "complete". Below is relevant code.
> {code}
>  // Generate running word count
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).groupBy("value").count();
> // Start running the query that prints the running counts to the console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("complete")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This example runs successfully and writes data in checkpoint directory. When 
> I re-run the program it throws below exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}
> Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query 
> with output mode as "append". Please see the code below.
> {code}
> // no aggregations
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).select("value");
> // append mode with console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("append")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This modified code runs successfully and writes data in checkpoint directory. 
> When I re-run the program it throws same exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889435#comment-15889435
 ] 

Apache Spark commented on SPARK-16929:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/17112

> Speculation-related synchronization bottleneck in checkSpeculatableTasks
> 
>
> Key: SPARK-16929
> URL: https://issues.apache.org/jira/browse/SPARK-16929
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Nicholas Brown
>
> Our cluster has been running slowly since I got speculation working, I looked 
> into it and noticed that stderr was saying some tasks were taking almost an 
> hour to run even though in the application logs on the nodes that task only 
> took a minute or so to run.  Digging into the thread dump for the master node 
> I noticed a number of threads are blocked, apparently by speculation thread.  
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while 
> it looks through the tasks to see what needs to be rerun.  Unfortunately that 
> code loops through each of the tasks, so when you have even just a couple 
> hundred thousand tasks to run that can be prohibitively slow to run inside of 
> a synchronized block.  Once I disabled speculation, the job went back to 
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the 
> git history seems to have a couple refactorings so its hard to find where it 
> was added.  I'm tempted to believe it is the result of someone assuming that 
> an extra synchronized block never hurt anyone (in reality I've probably just 
> as many bugs caused by over synchronization as too little) as it looks too 
> broad to be actually guarding any potential concurrency issue.  But, since 
> concurrency issues can be tricky to reproduce (and yes, I understand that's 
> an extreme understatement) I'm not sure just blindly removing it without 
> being familiar with the history is necessarily safe.  
> Can someone look into this?  Or at least make a note in the documentation 
> that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19778) alais cannot use in group by

2017-02-28 Thread xukun (JIRA)
xukun created SPARK-19778:
-

 Summary: alais cannot use in group by
 Key: SPARK-19778
 URL: https://issues.apache.org/jira/browse/SPARK-19778
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: xukun


not support “select key as key1 from src group by key1”



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889432#comment-15889432
 ] 

Reynold Xin commented on SPARK-17495:
-

Is it possible to figure out the hashing function based on file names? e.g. are 
the Spark files named differently from Hive files?


> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating

2017-02-28 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889419#comment-15889419
 ] 

Takeshi Yamamuro commented on SPARK-19754:
--

cc: [~hyukjin.kwon] what do u think this?

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14698) CREATE FUNCTION cloud not add function to hive metastore

2017-02-28 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889416#comment-15889416
 ] 

Shawn Lavelle commented on SPARK-14698:
---

[~poseidon] Would you be willing (and still able) to upload the patch?

> CREATE FUNCTION cloud not add function to hive metastore
> 
>
> Key: SPARK-14698
> URL: https://issues.apache.org/jira/browse/SPARK-14698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: spark1.6.1
>Reporter: poseidon
>  Labels: easyfix
>
> build spark 1.6.1 , and run it with 1.2.1 hive version,config mysql as 
> metastore server. 
> Start a thrift server , then in beeline , try to CREATE FUNCTION as HIVE SQL 
> UDF. 
> find out , can not add this FUNCTION to mysql metastore,but the function 
> usage goes well.
> if you try to add it again , thrift server throw a alread Exist Exception.
> [SPARK-10151][SQL] Support invocation of hive macro 
> add a if condition when runSqlHive, which will exec create function in 
> hiveexec. caused this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled

2017-02-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889412#comment-15889412
 ] 

Marcelo Vanzin edited comment on SPARK-18769 at 3/1/17 3:04 AM:


bq. What do you mean by this?

I mean that, if I remember the code correctly, Spark's code will generate one 
container request per container it wants. So if the schedulers decides it wants 
50k containers, the Spark allocator will be sending 50k request objects per 
heartbeat. That's memory used in the Spark allocator code, and results in 
memory used in the RM that's receiving the message. And if the RM code is 
naive, it will look at all those requests too when trying to allocate resources 
(increasing latency for the reply).


was (Author: vanzin):
bq. What do you mean by this?

I mean that, if I remember the code correctly, Spark's code will generate one 
container request per container it wants. So if the schedulers decides it wants 
50k containers, the Spark allocator will be sending 50k request objects per 
heartbeat. That's memory used in the Spark allocator code, and results in 
memory used in the RM that's receiving the message. And if the RM code is 
naive, it will look at all those requests too.

>  Spark to be smarter about what the upper bound is and to restrict number of 
> executor when dynamic allocation is enabled
> 
>
> Key: SPARK-18769
> URL: https://issues.apache.org/jira/browse/SPARK-18769
> Project: Spark
>  Issue Type: New Feature
>Reporter: Neerja Khattar
>
> Currently when dynamic allocation is enabled max.executor is infinite and 
> spark creates so many executor and even exceed the yarn nodemanager memory 
> limit and vcores.
> It should have a check to not exceed more that yarn resource limit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled

2017-02-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889412#comment-15889412
 ] 

Marcelo Vanzin commented on SPARK-18769:


bq. What do you mean by this?

I mean that, if I remember the code correctly, Spark's code will generate one 
container request per container it wants. So if the schedulers decides it wants 
50k containers, the Spark allocator will be sending 50k request objects per 
heartbeat. That's memory used in the Spark allocator code, and results in 
memory used in the RM that's receiving the message. And if the RM code is 
naive, it will look at all those requests too.

>  Spark to be smarter about what the upper bound is and to restrict number of 
> executor when dynamic allocation is enabled
> 
>
> Key: SPARK-18769
> URL: https://issues.apache.org/jira/browse/SPARK-18769
> Project: Spark
>  Issue Type: New Feature
>Reporter: Neerja Khattar
>
> Currently when dynamic allocation is enabled max.executor is infinite and 
> spark creates so many executor and even exceed the yarn nodemanager memory 
> limit and vcores.
> It should have a check to not exceed more that yarn resource limit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19777:


Assignee: (was: Apache Spark)

> Scan runningTasksSet when check speculatable tasks in TaskSetManager.
> -
>
> Key: SPARK-19777
> URL: https://issues.apache.org/jira/browse/SPARK-19777
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
>
> When check speculatable tasks in TaskSetManager, only scan runningTasksSet 
> instead of scanning all taskInfos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19777:


Assignee: Apache Spark

> Scan runningTasksSet when check speculatable tasks in TaskSetManager.
> -
>
> Key: SPARK-19777
> URL: https://issues.apache.org/jira/browse/SPARK-19777
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Minor
>
> When check speculatable tasks in TaskSetManager, only scan runningTasksSet 
> instead of scanning all taskInfos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889401#comment-15889401
 ] 

Apache Spark commented on SPARK-19777:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/17111

> Scan runningTasksSet when check speculatable tasks in TaskSetManager.
> -
>
> Key: SPARK-19777
> URL: https://issues.apache.org/jira/browse/SPARK-19777
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
>
> When check speculatable tasks in TaskSetManager, only scan runningTasksSet 
> instead of scanning all taskInfos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.

2017-02-28 Thread jin xing (JIRA)
jin xing created SPARK-19777:


 Summary: Scan runningTasksSet when check speculatable tasks in 
TaskSetManager.
 Key: SPARK-19777
 URL: https://issues.apache.org/jira/browse/SPARK-19777
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: jin xing
Priority: Minor


When check speculatable tasks in TaskSetManager, only scan runningTasksSet 
instead of scanning all taskInfos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19635:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19635:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Joseph K. Bradley
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889386#comment-15889386
 ] 

Apache Spark commented on SPARK-19635:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/17110

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Joseph K. Bradley
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled

2017-02-28 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889373#comment-15889373
 ] 

Yuming Wang commented on SPARK-18769:
-

How about this approach:
https://github.com/apache/spark/pull/16819

>  Spark to be smarter about what the upper bound is and to restrict number of 
> executor when dynamic allocation is enabled
> 
>
> Key: SPARK-18769
> URL: https://issues.apache.org/jira/browse/SPARK-18769
> Project: Spark
>  Issue Type: New Feature
>Reporter: Neerja Khattar
>
> Currently when dynamic allocation is enabled max.executor is infinite and 
> spark creates so many executor and even exceed the yarn nodemanager memory 
> limit and vcores.
> It should have a check to not exceed more that yarn resource limit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-19764:

Attachment: netty-6153.jpg

something like this:
!netty-6153.jpg!

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19740:


Assignee: (was: Apache Spark)

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>Priority: Minor
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19740:


Assignee: Apache Spark

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>Assignee: Apache Spark
>Priority: Minor
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889331#comment-15889331
 ] 

Apache Spark commented on SPARK-19740:
--

User 'yanji84' has created a pull request for this issue:
https://github.com/apache/spark/pull/17109

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>Priority: Minor
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert

2017-02-28 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889270#comment-15889270
 ] 

Jiang Xingbo commented on SPARK-19211:
--

I‘ve been working on it this week and perhaps I'll submit a PR later today. :)

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19636:


Assignee: Apache Spark  (was: Tim Hunter)

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19636:


Assignee: Tim Hunter  (was: Apache Spark)

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Tim Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889220#comment-15889220
 ] 

Apache Spark commented on SPARK-19636:
--

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/17108

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Tim Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19776) Is the JavaKafkaWordCount correct on Spark version 2.1

2017-02-28 Thread Russell Abedin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Abedin updated SPARK-19776:
---
Summary: Is the JavaKafkaWordCount correct on Spark version 2.1  (was: 
JavaKafkaWordCount calls createStream on version 2.1)

> Is the JavaKafkaWordCount correct on Spark version 2.1
> --
>
> Key: SPARK-19776
> URL: https://issues.apache.org/jira/browse/SPARK-19776
> Project: Spark
>  Issue Type: Question
>  Components: Examples, ML
>Affects Versions: 2.1.0
>Reporter: Russell Abedin
>
> My question is 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  correct?
> I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
> Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  looked to be perfect.
> However, when I tried running it, I found a couple of issues that I needed to 
> overcome.
> 1. This line was unnecessary:
> {code}
> StreamingExamples.setStreamingLogLevels();
> {code}
> Having this line in there (and the associated import) caused me to go looking 
> for a dependency spark-examples_2.10 which of no real use to me.
> 2. After running it, I this line: 
> {code}
> JavaPairReceiverInputDStream messages = 
> KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
> {code}
> Appeared to throw an error around logging:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/Logging
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11
> at 
> org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
> at 
> main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72)
> {code}
> To get around this, I found that the code sample in 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 
> helped me to come up with the right lines to see streaming from Kafka in 
> action. Specifically this called createDirectStream instead of createStream.
> So is the example in 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  or is there something I could have done differently to get that example 
> working?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1

2017-02-28 Thread Russell Abedin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Abedin updated SPARK-19776:
---
Affects Version/s: (was: 1.6.1)
   (was: 1.5.2)
   (was: 1.4.1)
   (was: 2.0.0)
   2.1.0

> JavaKafkaWordCount calls createStream on version 2.1
> 
>
> Key: SPARK-19776
> URL: https://issues.apache.org/jira/browse/SPARK-19776
> Project: Spark
>  Issue Type: Question
>  Components: Examples, ML
>Affects Versions: 2.1.0
>Reporter: Russell Abedin
>
> My question is 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  correct?
> I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
> Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  looked to be perfect.
> However, when I tried running it, I found a couple of issues that I needed to 
> overcome.
> 1. This line was unnecessary:
> {code}
> StreamingExamples.setStreamingLogLevels();
> {code}
> Having this line in there (and the associated import) caused me to go looking 
> for a dependency spark-examples_2.10 which of no real use to me.
> 2. After running it, I this line: 
> {code}
> JavaPairReceiverInputDStream messages = 
> KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
> {code}
> Appeared to throw an error around logging:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/Logging
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11
> at 
> org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
> at 
> main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72)
> {code}
> To get around this, I found that the code sample in 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 
> helped me to come up with the right lines to see streaming from Kafka in 
> action. Specifically this called createDirectStream instead of createStream.
> So is the example in 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  or is there something I could have done differently to get that example 
> working?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1

2017-02-28 Thread Russell Abedin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Abedin updated SPARK-19776:
---
Description: 
My question is 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 correct?

I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 looked to be perfect.

However, when I tried running it, I found a couple of issues that I needed to 
overcome.

1. This line was unnecessary:
{code}
StreamingExamples.setStreamingLogLevels();
{code}

Having this line in there (and the associated import) caused me to go looking 
for a dependency spark-examples_2.10 which of no real use to me.

2. After running it, I this line: 
{code}
JavaPairReceiverInputDStream messages = 
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
{code}

Appeared to throw an error around logging:
{code}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11
at 
org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
at main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72)
{code}

To get around this, I found that the code sample in 
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 
helped me to come up with the right lines to see streaming from Kafka in 
action. Specifically this called createDirectStream instead of createStream.

So is the example in 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
 or is there something I could have done differently to get that example 
working?

  was:
This bug is reported by
http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html
 .




> JavaKafkaWordCount calls createStream on version 2.1
> 
>
> Key: SPARK-19776
> URL: https://issues.apache.org/jira/browse/SPARK-19776
> Project: Spark
>  Issue Type: Question
>  Components: Examples, ML
>Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0
>Reporter: Russell Abedin
>
> My question is 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  correct?
> I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
> Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  looked to be perfect.
> However, when I tried running it, I found a couple of issues that I needed to 
> overcome.
> 1. This line was unnecessary:
> {code}
> StreamingExamples.setStreamingLogLevels();
> {code}
> Having this line in there (and the associated import) caused me to go looking 
> for a dependency spark-examples_2.10 which of no real use to me.
> 2. After running it, I this line: 
> {code}
> JavaPairReceiverInputDStream messages = 
> KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
> {code}
> Appeared to throw an error around logging:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/Logging
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> 

[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-02-28 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180
 ] 

Mingjie Tang edited comment on SPARK-19771 at 3/1/17 12:30 AM:
---

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
++-++
|datasetA|entry|   hashValue|
++-++
|[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1|
|[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
 H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~yunn] [~mlnick] 



was (Author: merlintang):
If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
++-++
|datasetA|entry|   hashValue|
++-++
|[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1|
|[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
 H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~diefunction] [~mlnick] 


> Support OR-AND 

[jira] [Assigned] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19635:
-

Assignee: Joseph K. Bradley

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Joseph K. Bradley
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889190#comment-15889190
 ] 

Joseph K. Bradley commented on SPARK-19635:
---

That PR for trees looks pretty different.  This task is just for wrapping 
chiSqTest with a DataFrame API.
I'm going to take this one.

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889181#comment-15889181
 ] 

Joseph K. Bradley commented on SPARK-19634:
---

I'll assign this to [~timhunter] given the time pressure for 2.2

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-02-28 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180
 ] 

Mingjie Tang commented on SPARK-19771:
--

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
++-++
|datasetA|entry|   hashValue|
++-++
|[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1|
|[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
 H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~diefunction] [~mlnick] 


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19634:
-

Assignee: Timothy Hunter

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1

2017-02-28 Thread Russell Abedin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Abedin updated SPARK-19776:
---
Issue Type: Question  (was: Bug)

> JavaKafkaWordCount calls createStream on version 2.1
> 
>
> Key: SPARK-19776
> URL: https://issues.apache.org/jira/browse/SPARK-19776
> Project: Spark
>  Issue Type: Question
>  Components: Examples, ML
>Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0
>Reporter: Russell Abedin
>
> This bug is reported by
> http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html
>  .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19776) JavaKafkaWordCount calls createStream on version 2.1

2017-02-28 Thread Russell Abedin (JIRA)
Russell Abedin created SPARK-19776:
--

 Summary: JavaKafkaWordCount calls createStream on version 2.1
 Key: SPARK-19776
 URL: https://issues.apache.org/jira/browse/SPARK-19776
 Project: Spark
  Issue Type: Bug
  Components: Examples, ML
Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0
Reporter: Russell Abedin


This bug is reported by
http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html
 .





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-02-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19382:
--
Shepherd: Joseph K. Bradley

> Test sparse vectors in LinearSVCSuite
> -
>
> Key: SPARK-19382
> URL: https://issues.apache.org/jira/browse/SPARK-19382
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
> recommend that generateSVMInput be modified to create a mix of dense and 
> sparse vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14503.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15415
[https://github.com/apache/spark/pull/15415]

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)

2017-02-28 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889120#comment-15889120
 ] 

Adrian Bridgett commented on SPARK-18579:
-

Yep, that's right. Sorry - not sure why I didn't reply to your initial comment!

> spark-csv strips whitespace (pyspark) 
> --
>
> Key: SPARK-18579
> URL: https://issues.apache.org/jira/browse/SPARK-18579
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
>Reporter: Adrian Bridgett
>Priority: Minor
>
> ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV 
> reader (and defaults to false).
> However these are not supported options on the CSV writer and so the library 
> defaults take place which strips the whitespace.
> I think it would make the most sense if the writer semantics matched the 
> reader (and did not alter the data) however this would be a change in 
> behaviour.  In any case it'd be great to have the _option_ to strip or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)

2017-02-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889117#comment-15889117
 ] 

Hyukjin Kwon commented on SPARK-18579:
--

Oh, I overlooked. You meant it always strips the white spaces when writing out 
via CSV datasource?

> spark-csv strips whitespace (pyspark) 
> --
>
> Key: SPARK-18579
> URL: https://issues.apache.org/jira/browse/SPARK-18579
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
>Reporter: Adrian Bridgett
>Priority: Minor
>
> ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV 
> reader (and defaults to false).
> However these are not supported options on the CSV writer and so the library 
> defaults take place which strips the whitespace.
> I think it would make the most sense if the writer semantics matched the 
> reader (and did not alter the data) however this would be a change in 
> behaviour.  In any case it'd be great to have the _option_ to strip or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-02-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889114#comment-15889114
 ] 

Hyukjin Kwon commented on SPARK-19713:
--

Hi [~balaramraju] Could you update the title?

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19729) Strange behaviour with reading csv with schema into dataframe

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19729.
--
Resolution: Invalid

I am resolving this as {{Invalid}}. Please reopen this if I was wrong with 
updating the JIRA description. I could not see what is a problem here.

> Strange behaviour with reading csv with schema into dataframe
> -
>
> Key: SPARK-19729
> URL: https://issues.apache.org/jira/browse/SPARK-19729
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.0.1
>Reporter: Mazen Melouk
>
> I have the following schema
> [{first,string_type,false}
> ,{second,string_type,false}
> ,{third,string_type,false}
> ,{fourth,string_type,false}]
> Example lines:
> var1,var2,,
> when accessing the row I get the following
> row.size =4
> row.fieldIndex(third_string)=2
> row.getAs(third_string)=var2
> row.get(2)=var2
> print(row)= var1,var2
> Any idea why the null values are missing?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16846) read.csv() option: "inferSchema" don't work

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16846.
--
Resolution: Not A Problem

If the schema is given, it does not infer the schema.

> read.csv()  option:  "inferSchema" don't work
> -
>
> Key: SPARK-16846
> URL: https://issues.apache.org/jira/browse/SPARK-16846
> Project: Spark
>  Issue Type: Bug
>Reporter: hejie
>
> I use the code to read file and get a dataframe. When the colum number is 
> olny 20, the inferSchema paragrama work well. But when the number is up to 
> 400, it doesn't work, and I have to tell it the schema manually. the code is :
>  val df = 
> spark.read.schema(schema).options(Map("header"->"true","quote"->",","inferSchema"->"true")).csv("/Users/ss/Documents/traindata/traindataAllNumber.csv")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-28 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889090#comment-15889090
 ] 

Tejas Patil commented on SPARK-17495:
-

[~rxin]: 

>> 1. On the read side we shouldn't care which hash function to use. All we 
>> need to know is that the data is hash partitioned by some hash function, and 
>> that should be sufficient to remove the shuffle needed in aggregation or 
>> join.

For joins, if one side is pre-hashed (due to bucketing) and other is not, then 
the non hashed side need to be shuffled with the _same_ hashing function as the 
pre-hashed one. Else, the output would be wrong. Or we can choose to shuffle 
both the sides but that wont utilize benefit of bucketing. 

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-02-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19373.
---
   Resolution: Fixed
 Assignee: Michael Gummelt
Fix Version/s: 2.2.0

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 2.2.0
>
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-02-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889084#comment-15889084
 ] 

Sean Owen commented on SPARK-19373:
---

Resolved by https://github.com/apache/spark/pull/17045

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 2.2.0
>
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889085#comment-15889085
 ] 

Joseph K. Bradley commented on SPARK-14503:
---

Sorry for the slow reply.  I actually haven't read enough about PrefixSpan to 
say, and the original paper doesn't seem to cover rule generation.

We're going with PrefixSpanModel currently.  The benefit from combining the 
models seems pretty small given the unknowns here, and if they can share 
implementations, we can do so internally in the future.

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16512.
--
Resolution: Duplicate

> No way to load CSV data without dropping whole rows when some of data is not 
> matched with given schema
> --
>
> Key: SPARK-16512
> URL: https://issues.apache.org/jira/browse/SPARK-16512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, there is no way to read CSV data without dropping whole rows when 
> some of data is not matched with given schema.
> It seems there are some usecases as below:
> {code}
> a,b
> 1,c
> {code}
> Here, {{a}} can be a dirty data in real usecases.
> But codes below:
> {code}
> val path = "/tmp/test.csv"
> val schema = StructType(
>   StructField("a", IntegerType, nullable = true) ::
>   StructField("b", StringType, nullable = true) :: Nil
> val df = spark.read
>   .format("csv")
>   .option("mode", "PERMISSIVE")
>   .schema(schema)
>   .load(path)
> df.show()
> {code}
> emits the exception below:
> {code}
> java.lang.NumberFormatException: For input string: "a"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:580)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an 
> exception.
> FYI, this is not the case for JSON because JSON data sources can handle this 
> with {{PERMISSIVE}} mode as below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
> val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
> spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
> {code}
> {code}
> ++
> |   a|
> ++
> |   1|
> |null|
> ++
> {code}
> Please refer https://github.com/databricks/spark-csv/pull/298



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19769) Quickstart self-contained application instructions do not work with current sbt

2017-02-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19769:
-

Assignee: Michael McCune

> Quickstart self-contained application instructions do not work with current 
> sbt
> ---
>
> Key: SPARK-19769
> URL: https://issues.apache.org/jira/browse/SPARK-19769
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael McCune
>Assignee: Michael McCune
>Priority: Trivial
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> The current quickstart instructions for the "Self-Contained Applications" 
> instructs the user to create a file named {{simple.sbt}} to instruct the 
> build tooling. With current versions of sbt(ie 1.0) however, the tooling does 
> not recognize the {{simple.sbt}} file as it is looking for a file named 
> {{build.sbt}}. 
> When following the quickstart instructions, I see the following output:
> {noformat}
> $ find .
> .
> ./simple.sbt
> ./src
> ./src/main
> ./src/main/scala
> ./src/main/scala/SimpleApp.scala
> [mike@ultra] master ~/workspace/sandbox/SimpleApp
> $ sbt package
> /home/mike/workspace/sandbox/SimpleApp doesn't appear to be an sbt project.
> If you want to start sbt anyway, run:
>   /home/mike/bin/sbt -sbt-create
> {noformat}
> Changing the filename to {{build.sbt}} produces a valid build:
> {noformat}
> $ mv simple.sbt build.sbt
> [mike@ultra] master ~/workspace/sandbox/SimpleApp
> $ sbt package
> [info] Set current project to Simple Project (in build 
> file:/home/mike/workspace/sandbox/SimpleApp/)
> [info] Updating {file:/home/mike/workspace/sandbox/SimpleApp/}simpleapp...
> [info] Resolving jline#jline;2.12.1 ...
> [info] Done updating.
> [info] Compiling 1 Scala source to 
> /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/classes...
> [info] Packaging 
> /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/simple-project_2.11-1.0.jar
>  ...
> [info] Done packaging.
> [success] Total time: 10 s, completed Feb 28, 2017 10:01:58 AM
> {noformat}
> I think the documentation just needs to be changed to reflect the new 
> filename.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19521) Error with embedded line break (multi-line record) in csv file.

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19521.
--
Resolution: Duplicate

I am resolving this as a duplicate of SPARK-19610 as that one has a PR merged.

> Error with embedded line break (multi-line record) in csv file.
> ---
>
> Key: SPARK-19521
> URL: https://issues.apache.org/jira/browse/SPARK-19521
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: $ uname -a
> Linux jupyter-test 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 
> 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Ruslan Korniichuk
>
> Input csv file:
> id,name,amount,isActive,Remark
> 1,Barney & Company,0,,"Great to work with 
> and always pays with cash."
> Output json file with spark-2.0.2-bin-hadoop2.7:
> {"id":"1","name":"Barney & Company","amount":0,"Remark":"Great to work with 
> \nand always pays with cash."}
> Error with spark-2.1.0-bin-hadoop2.7:
> java.lang.RuntimeException: Malformed line in FAILFAST mode: and always pays 
> with cash."
> 17/02/08 22:53:02 ERROR Utils: Aborting task
> java.lang.RuntimeException: Malformed line in FAILFAST mode: and always pays 
> with cash."
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:106)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/02/08 22:53:02 ERROR FileFormatWriter: Job job_20170208225302_0003 aborted.
> 17/02/08 22:53:02 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at 

[jira] [Resolved] (SPARK-17225) Support multiple null values in csv files

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17225.
--
Resolution: Duplicate

I am resolving this as a duplicate because that JIRA has a PR.

> Support multiple null values in csv files
> -
>
> Key: SPARK-17225
> URL: https://issues.apache.org/jira/browse/SPARK-17225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Since we're dealing with strings it's useful to have multiple different 
> representations of null values as data might not be fully normalized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14194.
--
Resolution: Duplicate

I proposed to solve this via {{wholeFile}} option and it seems merged. I am 
resolving this as a duplicate of that as that one has a PR.

> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.1.0
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19769) Quickstart self-contained application instructions do not work with current sbt

2017-02-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19769.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 17101
[https://github.com/apache/spark/pull/17101]

> Quickstart self-contained application instructions do not work with current 
> sbt
> ---
>
> Key: SPARK-19769
> URL: https://issues.apache.org/jira/browse/SPARK-19769
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael McCune
>Priority: Trivial
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> The current quickstart instructions for the "Self-Contained Applications" 
> instructs the user to create a file named {{simple.sbt}} to instruct the 
> build tooling. With current versions of sbt(ie 1.0) however, the tooling does 
> not recognize the {{simple.sbt}} file as it is looking for a file named 
> {{build.sbt}}. 
> When following the quickstart instructions, I see the following output:
> {noformat}
> $ find .
> .
> ./simple.sbt
> ./src
> ./src/main
> ./src/main/scala
> ./src/main/scala/SimpleApp.scala
> [mike@ultra] master ~/workspace/sandbox/SimpleApp
> $ sbt package
> /home/mike/workspace/sandbox/SimpleApp doesn't appear to be an sbt project.
> If you want to start sbt anyway, run:
>   /home/mike/bin/sbt -sbt-create
> {noformat}
> Changing the filename to {{build.sbt}} produces a valid build:
> {noformat}
> $ mv simple.sbt build.sbt
> [mike@ultra] master ~/workspace/sandbox/SimpleApp
> $ sbt package
> [info] Set current project to Simple Project (in build 
> file:/home/mike/workspace/sandbox/SimpleApp/)
> [info] Updating {file:/home/mike/workspace/sandbox/SimpleApp/}simpleapp...
> [info] Resolving jline#jline;2.12.1 ...
> [info] Done updating.
> [info] Compiling 1 Scala source to 
> /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/classes...
> [info] Packaging 
> /home/mike/workspace/sandbox/SimpleApp/target/scala-2.11/simple-project_2.11-1.0.jar
>  ...
> [info] Done packaging.
> [success] Total time: 10 s, completed Feb 28, 2017 10:01:58 AM
> {noformat}
> I think the documentation just needs to be changed to reflect the new 
> filename.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889067#comment-15889067
 ] 

Ari Gesher edited comment on SPARK-19764 at 2/28/17 11:04 PM:
--

That was the log in the application directory on the driver machine.

The other log was from the SPARK_LOG_DIR on one of the workers (*.112, as 
referenced in the log snippets and WebUI in the body of the comments)

We're working on a repro to get you a stack trace.


was (Author: agesher):
That was the log in the application directory on the driver machine.

We're working on a repro to get you a stack trace.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17224) Support skipping multiple header rows in csv

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17224.
--
Resolution: Duplicate

Now multiple header line can be dealt with by {{wholeFile}} option. Let me 
resolve this as a duplicate for now.

> Support skipping multiple header rows in csv
> 
>
> Key: SPARK-17224
> URL: https://issues.apache.org/jira/browse/SPARK-17224
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Headers can be multiline and sometimes you want to skip multiple rows because 
> of the format you've been given



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Ari Gesher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889067#comment-15889067
 ] 

Ari Gesher commented on SPARK-19764:


That was the log in the application directory on the driver machine.

We're working on a repro to get you a stack trace.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16102) Use Record API from Univocity rather than current data cast API.

2017-02-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889058#comment-15889058
 ] 

Hyukjin Kwon commented on SPARK-16102:
--

Yes, let me check out this API and other APIs too. Let me try to update this 
JIRA within this week.



> Use Record API from Univocity rather than current data cast API.
> 
>
> Key: SPARK-16102
> URL: https://issues.apache.org/jira/browse/SPARK-16102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> There is Record API for Univocity parser.
> This API provides typed data. Spark currently tries to compare and cast each 
> data.
> Using this library should reduce the codes in Spark and maybe improve the 
> performance. 
> It seems a benchmark should be proceeded first.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16103) Share a single Row for CSV data source rather than creating every time

2017-02-28 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16103.
--
Resolution: Duplicate

yup, fixed in https://github.com/apache/spark/pull/16669

> Share a single Row for CSV data source rather than creating every time
> --
>
> Key: SPARK-16103
> URL: https://issues.apache.org/jira/browse/SPARK-16103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source is creating each Row.
> Single row can be shared just like the other data sources.
> It is a bit not related but it might be great if some improvements were made 
> in term of performances. This was suggested in 
> https://github.com/apache/spark/pull/12268.
> - https://github.com/apache/spark/pull/12268#discussion_r61265122
> - https://github.com/apache/spark/pull/12268#discussion_r61270381
> - https://github.com/apache/spark/pull/12268#discussion_r61271158



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2017-02-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889047#comment-15889047
 ] 

Wenchen Fan commented on SPARK-18389:
-

[~jiangxb1987] are you working on it?

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert

2017-02-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889048#comment-15889048
 ] 

Wenchen Fan commented on SPARK-19211:
-

Hi [~jiangxb1987] are you working on it?

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case

2017-02-28 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19775:
--
Description: 
This issue removes [a test 
case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
 which was introduced by 
[SPARK-14459|https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e]
 and was superseded by 
[SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
 Basically, we cannot use `partitionBy` and `insertInto` together.

{code}
  test("Reject partitioning that does not match table") {
withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
  sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
(part string)")
  val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
else "odd"))
  .toDF("id", "data", "part")

  intercept[AnalysisException] {
// cannot partition by 2 fields when there is only one in the table 
definition
data.write.partitionBy("part", "data").insertInto("partitioned")
  }
}
  }
{code}


  was:
This issue removes [a test 
case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
 which was introduced by 
[SPARK-14459|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
 and was superseded by 
[SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
 Basically, we cannot use `partitionBy` and `insertInto` together.

{code}
  test("Reject partitioning that does not match table") {
withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
  sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
(part string)")
  val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
else "odd"))
  .toDF("id", "data", "part")

  intercept[AnalysisException] {
// cannot partition by 2 fields when there is only one in the table 
definition
data.write.partitionBy("part", "data").insertInto("partitioned")
  }
}
  }
{code}



> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-14459|https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e]
>  and was superseded by 
> [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16103) Share a single Row for CSV data source rather than creating every time

2017-02-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889042#comment-15889042
 ] 

Wenchen Fan commented on SPARK-16103:
-

seems it's already fixed?

> Share a single Row for CSV data source rather than creating every time
> --
>
> Key: SPARK-16103
> URL: https://issues.apache.org/jira/browse/SPARK-16103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source is creating each Row.
> Single row can be shared just like the other data sources.
> It is a bit not related but it might be great if some improvements were made 
> in term of performances. This was suggested in 
> https://github.com/apache/spark/pull/12268.
> - https://github.com/apache/spark/pull/12268#discussion_r61265122
> - https://github.com/apache/spark/pull/12268#discussion_r61270381
> - https://github.com/apache/spark/pull/12268#discussion_r61271158



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16103) Share a single Row for CSV data source rather than creating every time

2017-02-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889038#comment-15889038
 ] 

Wenchen Fan commented on SPARK-16103:
-

Hi [~hyukjin.kwon] are you working on it?

> Share a single Row for CSV data source rather than creating every time
> --
>
> Key: SPARK-16103
> URL: https://issues.apache.org/jira/browse/SPARK-16103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source is creating each Row.
> Single row can be shared just like the other data sources.
> It is a bit not related but it might be great if some improvements were made 
> in term of performances. This was suggested in 
> https://github.com/apache/spark/pull/12268.
> - https://github.com/apache/spark/pull/12268#discussion_r61265122
> - https://github.com/apache/spark/pull/12268#discussion_r61270381
> - https://github.com/apache/spark/pull/12268#discussion_r61271158



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16102) Use Record API from Univocity rather than current data cast API.

2017-02-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889036#comment-15889036
 ] 

Wenchen Fan commented on SPARK-16102:
-

Hi [~hyukjin.kwon] are you working on it?

> Use Record API from Univocity rather than current data cast API.
> 
>
> Key: SPARK-16102
> URL: https://issues.apache.org/jira/browse/SPARK-16102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> There is Record API for Univocity parser.
> This API provides typed data. Spark currently tries to compare and cast each 
> data.
> Using this library should reduce the codes in Spark and maybe improve the 
> performance. 
> It seems a benchmark should be proceeded first.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19774:


Assignee: Burak Yavuz  (was: Apache Spark)

> StreamExecution should call stop() on sources when a stream fails
> -
>
> Key: SPARK-19774
> URL: https://issues.apache.org/jira/browse/SPARK-19774
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> We call stop() on a Structured Streaming Source only when the stream is 
> shutdown when a user calls streamingQuery.stop(). We should actually stop all 
> sources when the stream fails as well, otherwise we may leak resources, e.g. 
> connections to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19774:


Assignee: Apache Spark  (was: Burak Yavuz)

> StreamExecution should call stop() on sources when a stream fails
> -
>
> Key: SPARK-19774
> URL: https://issues.apache.org/jira/browse/SPARK-19774
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> We call stop() on a Structured Streaming Source only when the stream is 
> shutdown when a user calls streamingQuery.stop(). We should actually stop all 
> sources when the stream fails as well, otherwise we may leak resources, e.g. 
> connections to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case

2017-02-28 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19775:
--
Description: 
This issue removes [a test 
case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
 which was introduced by 
[SPARK-14459|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
 and was superseded by 
[SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
 Basically, we cannot use `partitionBy` and `insertInto` together.

{code}
  test("Reject partitioning that does not match table") {
withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
  sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
(part string)")
  val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
else "odd"))
  .toDF("id", "data", "part")

  intercept[AnalysisException] {
// cannot partition by 2 fields when there is only one in the table 
definition
data.write.partitionBy("part", "data").insertInto("partitioned")
  }
}
  }
{code}


  was:
This issue removes [a test 
case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
 which was introduced by 
[SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
 and was superseded by 
[SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
 Basically, we cannot use `partitionBy` and `insertInto` together.

{code}
  test("Reject partitioning that does not match table") {
withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
  sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
(part string)")
  val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
else "odd"))
  .toDF("id", "data", "part")

  intercept[AnalysisException] {
// cannot partition by 2 fields when there is only one in the table 
definition
data.write.partitionBy("part", "data").insertInto("partitioned")
  }
}
  }
{code}



> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-14459|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
>  and was superseded by 
> [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-02-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889026#comment-15889026
 ] 

Shixiong Zhu commented on SPARK-19764:
--

[~agesher] driver-log-stderr.log is actually the executor log. Did you upload a 
wrong file?

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889032#comment-15889032
 ] 

Apache Spark commented on SPARK-19774:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17107

> StreamExecution should call stop() on sources when a stream fails
> -
>
> Key: SPARK-19774
> URL: https://issues.apache.org/jira/browse/SPARK-19774
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> We call stop() on a Structured Streaming Source only when the stream is 
> shutdown when a user calls streamingQuery.stop(). We should actually stop all 
> sources when the stream fails as well, otherwise we may leak resources, e.g. 
> connections to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-02-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889030#comment-15889030
 ] 

Wenchen Fan commented on SPARK-14480:
-

The regression has been fixed in 
https://issues.apache.org/jira/browse/SPARK-19610

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889028#comment-15889028
 ] 

Apache Spark commented on SPARK-19775:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17106

> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
>  and was superseded by 
> [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19775:


Assignee: (was: Apache Spark)

> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
>  and was superseded by 
> [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case

2017-02-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19775:


Assignee: Apache Spark

> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-16033|https://github.com/apache/spark/commit/10b671447bc04af250cbd8a7ea86f2769147a78a]
>  and was superseded by 
> [SPARK-14459|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-02-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-14480.
-
Resolution: Fixed

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >