date:20161028

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-28 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617506#comment-15617506
 ] 

Liwei Lin commented on SPARK-16845:
---

[~dondrake] yea I would expect it to work as long as your .jar file does not 
contain any org.apache.spark binaries, i.e. you're not building a fat jar which 
includes the spark dependencies.

The error message in your attached file indicates it's 
`GeneratedClass$SpecificUnsafeProjection` growing beyond 64K - I believe it's 
somewhat related with `explode` and `project` code generation, which is a 
different issue from the `order` code generation issue covered here. :-D

Would you mind opening a new JIRA for that issue?

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18164) ForeachSink should fail the Spark job if `process` throws exception

2016-10-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18164:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-8360

> ForeachSink should fail the Spark job if `process` throws exception
> ---
>
> Key: SPARK-18164
> URL: https://issues.apache.org/jira/browse/SPARK-18164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18164) ForeachSink should fail the Spark job if `process` throws exception

2016-10-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18164.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

> ForeachSink should fail the Spark job if `process` throws exception
> ---
>
> Key: SPARK-18164
> URL: https://issues.apache.org/jira/browse/SPARK-18164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-10-28 Thread Richard Moorhead (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Moorhead updated SPARK-18108:
-
Issue Type: Bug  (was: Question)

> Partition discovery fails with explicitly written long partitions
> -
>
> Key: SPARK-18108
> URL: https://issues.apache.org/jira/browse/SPARK-18108
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Richard Moorhead
>Priority: Minor
> Attachments: stacktrace.out
>
>
> We have parquet data written from Spark1.6 that, when read from 2.0.1, 
> produces errors.
> {code}
> case class A(a: Long, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The above code fails; stack trace attached. 
> If an integer used, explicit partition discovery succeeds.
> {code}
> case class A(a: Int, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The action succeeds. Additionally, if 'partitionBy' is used instead of 
> explicit writes, partition discovery succeeds. 
> Question: Is the first example a reasonable use case? 
> [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
>  seems to default to Integer types unless the partition value exceeds the 
> integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16312) Docs for Kafka 0.10 consumer integration

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617150#comment-15617150
 ] 

Apache Spark commented on SPARK-16312:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15679

> Docs for Kafka 0.10 consumer integration
> 
>
> Key: SPARK-16312
> URL: https://issues.apache.org/jira/browse/SPARK-16312
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617070#comment-15617070
 ] 

Apache Spark commented on SPARK-17963:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15677

> Add examples (extend) in each function and improve documentation with 
> arguments
> ---
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
> cardinality by HyperLogLog++
>   with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-28 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617004#comment-15617004
 ] 

Michael Allman commented on SPARK-17344:


We (at VideoAmp) would love to use structured streaming with Kafka. However we 
use Kafka 0.8 and have no present desire or reason to upgrade. There's nothing 
valuable to us in the newer versions, and performing an upgrade would be a 
highly non-trivial undertaking given that so many of our production systems 
integrate with Kafka. We really don't want to mess with something like that.

I believe the new streaming API has great potential to simplify our streaming 
apps, and I'm eager to try a fresh approach to streaming in Spark. I cannot 
commit to making a contribution to this effort, but I wanted to voice my 
opinion and cast my vote.

Thanks.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18168) Revert the change of SPARK-18167

2016-10-28 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616980#comment-15616980
 ] 

Yin Huai commented on SPARK-18168:
--

cc [~ekhliang]

> Revert the change of SPARK-18167
> 
>
> Key: SPARK-18168
> URL: https://issues.apache.org/jira/browse/SPARK-18168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We need to revert the change for 
> https://github.com/apache/spark/pull/15676/files before the release. That 
> jira is used to investigate a flaky test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18168) Revert the change of SPARK-18167

2016-10-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18168:
-
Target Version/s: 2.1.0

> Revert the change of SPARK-18167
> 
>
> Key: SPARK-18168
> URL: https://issues.apache.org/jira/browse/SPARK-18168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We need to revert the change for 
> https://github.com/apache/spark/pull/15676/files before the release. That 
> jira is used to investigate a flaky test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18168) Revert the change of SPARK-18167

2016-10-28 Thread Yin Huai (JIRA)

Yin Huai created SPARK-18168:


 Summary: Revert the change of SPARK-18167
 Key: SPARK-18168
 URL: https://issues.apache.org/jira/browse/SPARK-18168
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


We need to revert the change for 
https://github.com/apache/spark/pull/15676/files before the release. That jira 
is used to investigate a flaky test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18168) Revert the change of SPARK-18167

2016-10-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18168:
-
Priority: Blocker  (was: Major)

> Revert the change of SPARK-18167
> 
>
> Key: SPARK-18168
> URL: https://issues.apache.org/jira/browse/SPARK-18168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We need to revert the change for 
> https://github.com/apache/spark/pull/15676/files before the release. That 
> jira is used to investigate a flaky test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18167:


Assignee: (was: Apache Spark)

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark

[jira] [Assigned] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18167:


Assignee: Apache Spark

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
>

[jira] [Commented] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616845#comment-15616845
 ] 

Apache Spark commented on SPARK-18167:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15676

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spa

[jira] [Created] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-28 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18167:
--

 Summary: Flaky test when hive partition pruning is enabled
 Key: SPARK-18167
 URL: https://issues.apache.org/jira/browse/SPARK-18167
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
partition pruning is enabled.

Based on the stack traces, it seems to be an old issue where Hive fails to cast 
a numeric partition column ("Invalid character string format for type 
DECIMAL"). There are two possibilities here: either we are somehow corrupting 
the partition table to have non-decimal values in that column, or there is a 
transient issue with Derby.

{code}
Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null  at 
sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
   at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
at 
org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
  at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)  
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
  at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)  
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
 at scala.collection.immutable.List.foreach(List.scala:381)  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) 
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
 at 
org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:234)  at 
org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:170)  at 
org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$76$$anonfun$apply$mcV$sp$24.apply$mcV$sp(SQLQuerySuite.scala:1559)

[jira] [Updated] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-10-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18144:
-
Priority: Minor  (was: Major)

> StreamingQueryListener.QueryStartedEvent is not written to event log
> 
>
> Key: SPARK-18144
> URL: https://issues.apache.org/jira/browse/SPARK-18144
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution

2016-10-28 Thread Wayne Zhang (JIRA)

Wayne Zhang created SPARK-18166:
---

 Summary: GeneralizedLinearRegression Wrong Value Range for Poisson 
Distribution  
 Key: SPARK-18166
 URL: https://issues.apache.org/jira/browse/SPARK-18166
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0
Reporter: Wayne Zhang


The current implementation of Poisson GLM seems to allow only positive values 
(See below). This is not correct since the support of Poisson includes the 
origin. 

override def initialize(y: Double, weight: Double): Double = {
  require(y {color:red} > {color} 0.0, "The response variable of Poisson 
family " +
s"should be positive, but got $y")
  y
}

The fix is easy, just change it to 
 require(y {color:red} >= {color} 0.0, "The response variable of Poisson family 
" +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-10-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18081:
--
Target Version/s: 2.1.0

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5992.
--
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s:   (was: )

Issue resolved by pull request 15148
[https://github.com/apache/spark/pull/15148]

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.1.0
>
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5992:
-
Component/s: (was: MLlib)
 ML

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.1.0
>
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18143:
-
Priority: Blocker  (was: Major)

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17791) Join reordering using star schema detection

2016-10-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17791:

Assignee: Ioana Delaney

> Join reordering using star schema detection
> ---
>
> Key: SPARK-17791
> URL: https://issues.apache.org/jira/browse/SPARK-17791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Assignee: Ioana Delaney
>Priority: Critical
> Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17626:

Target Version/s: 2.2.0

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17791) Join reordering using star schema detection

2016-10-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17791:

Target Version/s: 2.2.0

> Join reordering using star schema detection
> ---
>
> Key: SPARK-17791
> URL: https://issues.apache.org/jira/browse/SPARK-17791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17791) Join reordering using star schema detection

2016-10-28 Thread Ron Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616590#comment-15616590
 ] 

Ron Hu commented on SPARK-17791:


This JIRA is indeed complementary to Cost-Based Optimizer (or CBO) project.  
But we need to be more careful, and as a result we should do this together with 
CBO.  Let me explain it below.

Previously I commented in SPARK-17626 , we need detailed statistics such as 
number of distinct values for a join column and number of records of a table in 
order to decide fact tables and dimension tables.  Today we are collecting 
statistics to reliably predict fact tables and dimension tables in CBO project. 
 In addition, we are estimating selectivity for every relational algebra 
operator in CBO so that we can give reliable plan cardinality.

Without CBO's support, the author currently is using SQL hint to provide 
predicate selectivity.  Note that SQL hint does not work well as it is not 
automated and it just shows the current weakness of Spark SQL optimizer.  The 
author also uses initial table size to predict fact/dimension tables.  As we 
know, after applying predicate, the relevant records to participate a join may 
be a small subset of the initial table.  Hence initial table size is not a 
reliable way to decide a star schema.  
 
While this PR can show promising performance gain on most tpc-ds benchmark 
queries as tpc-ds has a well-know star schema, but can this approach still work 
in real world applications which do not clearly define a star schema? 
 
Therefore, I suggest that [SPARK-17791] and [SPARK-17626] should be tightly 
integrated with CBO project.  We are set to release CBO in Spark 2.2.  With the 
progress made so far, we can begin integrating star join reordering in December.

The same comment also applies to [SPARK-17626].

My two cents.  Thanks.

> Join reordering using star schema detection
> ---
>
> Key: SPARK-17791
> URL: https://issues.apache.org/jira/browse/SPARK-17791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616584#comment-15616584
 ] 

Dongjoon Hyun commented on SPARK-18123:
---

Hi, [~zwu@gmail.com].
Do you think you could check whether the PR solve your problem?

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616575#comment-15616575
 ] 

Dongjoon Hyun commented on SPARK-17612:
---

Ah, I see. For that question, this issue isn't related to that. I think so. :)

> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-28 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616542#comment-15616542
 ] 

Franck Tago commented on SPARK-17612:
-

Hi 
Thanks for the quick reply. So I was only trying a long shot , because I wanted 
to know if the join issue that i described was related somehow to this issue . 
I know it is not probable but just wanted to check . 

> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18165) Kinesis support in Structured Streaming

2016-10-28 Thread Lauren Moos (JIRA)

Lauren Moos created SPARK-18165:
---

 Summary: Kinesis support in Structured Streaming
 Key: SPARK-18165
 URL: https://issues.apache.org/jira/browse/SPARK-18165
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Lauren Moos


Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616503#comment-15616503
 ] 

Apache Spark commented on SPARK-18144:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/15675

> StreamingQueryListener.QueryStartedEvent is not written to event log
> 
>
> Key: SPARK-18144
> URL: https://issues.apache.org/jira/browse/SPARK-18144
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18144:


Assignee: (was: Apache Spark)

> StreamingQueryListener.QueryStartedEvent is not written to event log
> 
>
> Key: SPARK-18144
> URL: https://issues.apache.org/jira/browse/SPARK-18144
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18144:


Assignee: Apache Spark

> StreamingQueryListener.QueryStartedEvent is not written to event log
> 
>
> Key: SPARK-18144
> URL: https://issues.apache.org/jira/browse/SPARK-18144
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616422#comment-15616422
 ] 

Dongjoon Hyun commented on SPARK-17612:
---

Hi, [~tafra...@gmail.com].
I think you are asking about *join* problem. Is it related to this `DESC 
PARTITION`?
{code}
Hi 
Basically I have an issue where I am performing the following operations.
Partitioned Large Hive Table (hive table 1) – filter — join 
/ 
Non Partitioned Large Hive Table
Basically I am join 2 large tables . Both table raw size exceed the broadcast 
join threshold.
The filter filter a specific partition . This partition is small enough so that 
its size is smaller than the broadcast join threshold.
With Spark 2.0 and Spark 2.0.1 , I do not see a broadcast join . I see a sort 
merge join. 
Which is really surprising to me given that this could be a really common case. 
You can imagine a user who has a large log table partitioned by date and he 
filters on a specific date. We should be able to do a broadcast join in that 
case.
The question now is the following .
I do not think this Spark Issue addresses the cited problem but I could be 
wrong . I tried incorporating the change in the spark 2.0 PR but I see the same 
behavior . That is no broadcast join.
Question : Is this spark issue supposed to address the problem that I mentioned 
?
If not , which i think is the case , do you know if spark currently has a fix 
for the cited issue.
I also tried the fix under SPARK-15616 but I hit a runtime failure .
There has got to be a solution to this problem somewhere.
{code}

> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsu

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-28 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616385#comment-15616385
 ] 

Franck Tago commented on SPARK-15616:
-

So I tried the changed that were made for this issue. SPARK-15616 but I hit a 
runtime issue when trying to test it.

scala> spark.conf.set("spark.sql.statistics.fallBackToHdfs" , "true")

scala> spark.conf.set("spark.sql.statistics.partitionPruner" ,"true")

scala> val df1= spark.table("ft_p")
df1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]

scala> val df2=spark.table("ft_p_no")
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]

scala> df2.join((df1.select($"col1".as("acol1"), 
$"col3".as("acol3")).filter($"acol3"===5)) , $"col1"===$"acol1" ).explain
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: col3#43



> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-28 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616376#comment-15616376
 ] 

Franck Tago commented on SPARK-17612:
-

Hi  
Basically  I have an issue where I am performing the following operations.

Partitioned Large  Hive Table (hive table 1)  --  filter   ---join 

  /
 Non Partitioned  Large Hive Table

Basically I am join 2 large tables .  Both table raw size exceed the  broadcast 
join threshold.
The filter filter a specific partition . This partition is small enough so that 
its size is smaller than the broadcast join threshold.

With Spark 2.0 and Spark 2.0.1 , I do not see  a broadcast join . I see a  sort 
merge join.  
Which is really  surprising to me given that this could be a really common  
case. You can imagine a user who has a large log table partitioned by date and 
he filters on a specific date. We should be able to do a broadcast join in that 
case. 

The question now is the following .  

I do not think this Spark Issue addresses the cited problem but I could be 
wrong  . I tried incorporating the change in the spark 2.0 PR but I see the 
same behavior . That is no broadcast join.  

Question :  Is this spark issue supposed to address the problem that I 
mentioned ?  

- If not  , which i think is the case , do you know if spark currently has a 
fix for the cited issue.  
I also tried the fix under   SPARK-15616 but I hit a runtime failure .

There has got to be a solution to this problem somewhere.





> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616159#comment-15616159
 ] 

Sean Owen commented on SPARK-882:
-

You're right, I was thinking of the API docs rather than the general 
documentation. If there's an easy way to do that, OK, otherwise yeah this is 
effectively already done.

> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>Priority: Minor
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18164) ForeachSink should fail the Spark job if `process` throws exception

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18164:


Assignee: Apache Spark  (was: Shixiong Zhu)

> ForeachSink should fail the Spark job if `process` throws exception
> ---
>
> Key: SPARK-18164
> URL: https://issues.apache.org/jira/browse/SPARK-18164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18164) ForeachSink should fail the Spark job if `process` throws exception

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18164:


Assignee: Shixiong Zhu  (was: Apache Spark)

> ForeachSink should fail the Spark job if `process` throws exception
> ---
>
> Key: SPARK-18164
> URL: https://issues.apache.org/jira/browse/SPARK-18164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18164) ForeachSink should fail the Spark job if `process` throws exception

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616143#comment-15616143
 ] 

Apache Spark commented on SPARK-18164:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15674

> ForeachSink should fail the Spark job if `process` throws exception
> ---
>
> Key: SPARK-18164
> URL: https://issues.apache.org/jira/browse/SPARK-18164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18164) ForeachSink should fail the Spark job if `process` throws exception

2016-10-28 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-18164:


 Summary: ForeachSink should fail the Spark job if `process` throws 
exception
 Key: SPARK-18164
 URL: https://issues.apache.org/jira/browse/SPARK-18164
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 2.0.1, 2.0.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17992:


Assignee: (was: Apache Spark)

> HiveClient.getPartitionsByFilter throws an exception for some unsupported 
> filters when hive.metastore.try.direct.sql=false
> --
>
> Key: SPARK-17992
> URL: https://issues.apache.org/jira/browse/SPARK-17992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> We recently added (and enabled by default) table partition pruning for 
> partitioned Hive tables converted to using {{TableFileCatalog}}. When the 
> Hive configuration option {{hive.metastore.try.direct.sql}} is set to 
> {{false}}, Hive will throw an exception for unsupported filter expressions. 
> For example, attempting to filter on an integer partition column will throw a 
> {{org.apache.hadoop.hive.metastore.api.MetaException}}.
> I discovered this behavior because VideoAmp uses the CDH version of Hive with 
> a Postgresql metastore DB. In this configuration, CDH sets 
> {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that 
> filter on a non-string partition column will fail. That would be a rather 
> rude surprise for these Spark 2.1 users...
> I'm not sure exactly what behavior we should expect, but I suggest that 
> {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and 
> return all partitions instead. This is what Spark does for Hive 0.12 users, 
> which does not support this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17992:


Assignee: Apache Spark

> HiveClient.getPartitionsByFilter throws an exception for some unsupported 
> filters when hive.metastore.try.direct.sql=false
> --
>
> Key: SPARK-17992
> URL: https://issues.apache.org/jira/browse/SPARK-17992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>Assignee: Apache Spark
>
> We recently added (and enabled by default) table partition pruning for 
> partitioned Hive tables converted to using {{TableFileCatalog}}. When the 
> Hive configuration option {{hive.metastore.try.direct.sql}} is set to 
> {{false}}, Hive will throw an exception for unsupported filter expressions. 
> For example, attempting to filter on an integer partition column will throw a 
> {{org.apache.hadoop.hive.metastore.api.MetaException}}.
> I discovered this behavior because VideoAmp uses the CDH version of Hive with 
> a Postgresql metastore DB. In this configuration, CDH sets 
> {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that 
> filter on a non-string partition column will fail. That would be a rather 
> rude surprise for these Spark 2.1 users...
> I'm not sure exactly what behavior we should expect, but I suggest that 
> {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and 
> return all partitions instead. This is what Spark does for Hive 0.12 users, 
> which does not support this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616068#comment-15616068
 ] 

Apache Spark commented on SPARK-17992:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/15673

> HiveClient.getPartitionsByFilter throws an exception for some unsupported 
> filters when hive.metastore.try.direct.sql=false
> --
>
> Key: SPARK-17992
> URL: https://issues.apache.org/jira/browse/SPARK-17992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> We recently added (and enabled by default) table partition pruning for 
> partitioned Hive tables converted to using {{TableFileCatalog}}. When the 
> Hive configuration option {{hive.metastore.try.direct.sql}} is set to 
> {{false}}, Hive will throw an exception for unsupported filter expressions. 
> For example, attempting to filter on an integer partition column will throw a 
> {{org.apache.hadoop.hive.metastore.api.MetaException}}.
> I discovered this behavior because VideoAmp uses the CDH version of Hive with 
> a Postgresql metastore DB. In this configuration, CDH sets 
> {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that 
> filter on a non-string partition column will fail. That would be a rather 
> rude surprise for these Spark 2.1 users...
> I'm not sure exactly what behavior we should expect, but I suggest that 
> {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and 
> return all partitions instead. This is what Spark does for Hive 0.12 users, 
> which does not support this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-28 Thread Deron Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615975#comment-15615975
 ] 

Deron Eriksson commented on SPARK-882:
--

It looks like the More menu in the docs (http://spark.apache.org/docs/2.0.1/) 
already contains a "Contributing to Spark" link which takes the user to 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark. Since 
the link already exists, perhaps this JIRA should be resolved and closed?

> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>Priority: Minor
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-28 Thread Michael Patterson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Environment: Pyspark 2.0.1, Ipython 4.2  (was: Pyspark 2.0.0, Ipython 4.2)

> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> {code}
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
> ^s)
>+- LogicalRDD [input#0L]
> {code}
> Executing test_def.show() after the above code in pyspark 2.0.1 yields:
> KeyError: 0
> Executing test_def.show() in pyspark 1.6.2 yields:
> {code}
> +-+--+
> |input|output|
> +-+--+
> |2|second|
> +-+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13331) AES support for over-the-wire encryption

2016-10-28 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615888#comment-15615888
 ] 

Marcelo Vanzin commented on SPARK-13331:


You need to be a little patient. People have things to do outside of reviewing 
your code.

> AES support for over-the-wire encryption
> 
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18162) SparkEnv.get.metricsSystem in spark-shell results in error: missing or invalid dependency detected while loading class file 'MetricsSystem.class'

2016-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615419#comment-15615419
 ] 

Sean Owen commented on SPARK-18162:
---

I don't observe this on master right now. Are you sure you did a clean build?

> SparkEnv.get.metricsSystem in spark-shell results in error: missing or 
> invalid dependency detected while loading class file 'MetricsSystem.class'
> -
>
> Key: SPARK-18162
> URL: https://issues.apache.org/jira/browse/SPARK-18162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> This is with the build today from master.
> {code}
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> Branch master
> Compiled by user jacek on 2016-10-28T04:05:11Z
> Revision ab5f938bc7c3c9b137d63e479fced2b7e9c9d75b
> Url https://github.com/apache/spark.git
> Type --help for more information.
> $ ./bin/spark-shell
> scala> SparkEnv.get.metricsSystem
> error: missing or invalid dependency detected while loading class file 
> 'MetricsSystem.class'.
> Could not access term eclipse in package org,
> because it (or its dependencies) are missing. Check your build definition for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.
> error: missing or invalid dependency detected while loading class file 
> 'MetricsSystem.class'.
> Could not access term jetty in value org.eclipse,
> because it (or its dependencies) are missing. Check your build definition for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.eclipse.
> scala> spark.version
> res3: String = 2.1.0-SNAPSHOT
> {code}
> I could not find any information about how to set it up in the [official 
> documentation|http://spark.apache.org/docs/latest/monitoring.html#metrics].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18163) Union unexpected behaviour when generating data frames programatically

2016-10-28 Thread Ulrich zink (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ulrich zink closed SPARK-18163.
---
Resolution: Invalid

> Union unexpected behaviour when generating data frames programatically
> --
>
> Key: SPARK-18163
> URL: https://issues.apache.org/jira/browse/SPARK-18163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Ulrich zink
>
> //expected behaviour
> val df1 = Seq((1,2),(3,4)).toDF("a","b")
> val df2 = Seq((5,6)).toDF("a","b")
> df1.union(df2).show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  3|  4|
> |  5|  6|
> +---+---+
> // When generating the data frames programmatically
> val nInst = 2
> val fltr = 1
> case class Instrument(id: Long,  value: Double)
> def dataset (nst:Int,fltrVal:Int) = sqlContext.range(0, nst).select(($"id"),
> 
> round(abs(randn)).alias("value")).as[Instrument].filter('value > fltrVal)
> val df3 = dataset(nInst,fltr)
> val df4 = dataset(nInst,fltr)
> df3.show()
> df4.show()
> df3.union(df4).show()
> df3: org.apache.spark.sql.Dataset[Instrument] = [id: bigint, value: double]
> +---+-+
> | id|value|
> +---+-+
> |  0|  1.0|
> |  1|  1.0|
> +---+-+
> df4: org.apache.spark.sql.Dataset[Instrument] = [id: bigint, value: double]
> +---+-+
> | id|value|
> +---+-+
> |  0|  1.0|
> |  1|  0.0|
> +---+-+
> +---+-+
> | id|value|
> +---+-+
> |  0|  1.0|
> |  1|  1.0|
> |  0|  1.0|
> |  1|  2.0|
> +---+-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18148) Misleading Error Message for Aggregation Without Window/GroupBy

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18148:


Assignee: (was: Apache Spark)

> Misleading Error Message for Aggregation Without Window/GroupBy
> ---
>
> Key: SPARK-18148
> URL: https://issues.apache.org/jira/browse/SPARK-18148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Pat McDonough
>
> The following error message points to a random column I'm not actually using 
> in my query, making it hard to diagnose.
> {code}
> org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;
> {code}
> Note in the code below, I forgot to add {{.over(weeklyWindow)}} in the line 
> for {{withColumn("user_count"...}}
> {code}
> spark.read.load("/some-data")
>   .withColumn("date_dt", to_date($"date"))
>   .withColumn("year", year($"date_dt"))
>   .withColumn("week", weekofyear($"date_dt"))
>   .withColumn("user_count", count($"userId"))
>   .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
> )
> {code}
> CC: [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18148) Misleading Error Message for Aggregation Without Window/GroupBy

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18148:


Assignee: Apache Spark

> Misleading Error Message for Aggregation Without Window/GroupBy
> ---
>
> Key: SPARK-18148
> URL: https://issues.apache.org/jira/browse/SPARK-18148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Pat McDonough
>Assignee: Apache Spark
>
> The following error message points to a random column I'm not actually using 
> in my query, making it hard to diagnose.
> {code}
> org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;
> {code}
> Note in the code below, I forgot to add {{.over(weeklyWindow)}} in the line 
> for {{withColumn("user_count"...}}
> {code}
> spark.read.load("/some-data")
>   .withColumn("date_dt", to_date($"date"))
>   .withColumn("year", year($"date_dt"))
>   .withColumn("week", weekofyear($"date_dt"))
>   .withColumn("user_count", count($"userId"))
>   .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
> )
> {code}
> CC: [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18148) Misleading Error Message for Aggregation Without Window/GroupBy

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615121#comment-15615121
 ] 

Apache Spark commented on SPARK-18148:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15672

> Misleading Error Message for Aggregation Without Window/GroupBy
> ---
>
> Key: SPARK-18148
> URL: https://issues.apache.org/jira/browse/SPARK-18148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Pat McDonough
>
> The following error message points to a random column I'm not actually using 
> in my query, making it hard to diagnose.
> {code}
> org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;
> {code}
> Note in the code below, I forgot to add {{.over(weeklyWindow)}} in the line 
> for {{withColumn("user_count"...}}
> {code}
> spark.read.load("/some-data")
>   .withColumn("date_dt", to_date($"date"))
>   .withColumn("year", year($"date_dt"))
>   .withColumn("week", weekofyear($"date_dt"))
>   .withColumn("user_count", count($"userId"))
>   .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
> )
> {code}
> CC: [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18163) Union unexpected behaviour when generating data frames programatically

2016-10-28 Thread Ulrich zink (JIRA)

Ulrich zink created SPARK-18163:
---

 Summary: Union unexpected behaviour when generating data frames 
programatically
 Key: SPARK-18163
 URL: https://issues.apache.org/jira/browse/SPARK-18163
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Ulrich zink


//expected behaviour
val df1 = Seq((1,2),(3,4)).toDF("a","b")
val df2 = Seq((5,6)).toDF("a","b")
df1.union(df2).show()

+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
+---+---+

// When generating the data frames programmatically

val nInst = 2
val fltr = 1

case class Instrument(id: Long,  value: Double)
def dataset (nst:Int,fltrVal:Int) = sqlContext.range(0, nst).select(($"id"),
round(abs(randn)).alias("value")).as[Instrument].filter('value 
> fltrVal)
val df3 = dataset(nInst,fltr)
val df4 = dataset(nInst,fltr)

df3.show()
df4.show()
df3.union(df4).show()

df3: org.apache.spark.sql.Dataset[Instrument] = [id: bigint, value: double]
+---+-+
| id|value|
+---+-+
|  0|  1.0|
|  1|  1.0|
+---+-+
df4: org.apache.spark.sql.Dataset[Instrument] = [id: bigint, value: double]
+---+-+
| id|value|
+---+-+
|  0|  1.0|
|  1|  0.0|
+---+-+
+---+-+
| id|value|
+---+-+
|  0|  1.0|
|  1|  1.0|
|  0|  1.0|
|  1|  2.0|
+---+-+




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615034#comment-15615034
 ] 

Kazuaki Ishizaki commented on SPARK-18125:
--

I confirmed this code can reproduce on 2.0.1.
This problem occurs due to the similar reason in SPARK-18147

To call {{ctx.splitExpression}} in {{createStruct.doGenCode}} make *a variable* 
inaccesssible by splitting the original one function into multiple functions.
SPARK-14793 seems to have introduced {{ctx.splitExpression}} here to fix other 
issues on April. Since Ray says this code works in 2.0.0, other changes may 
introduce this issue.

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 0

[jira] [Commented] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice

2016-10-28 Thread Stephan Kepser (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615025#comment-15615025
 ] 

Stephan Kepser commented on SPARK-18159:


I saw the old executors kept running for several hours (more than 5h). 
And we have a Stand-alone Spark cluster without Yarn or Mesos. Thus using yarn 
to kill the old executors is unfortunately not an option. And killing the old 
executors via the REST API also failed. They are immediately re-started. 

> Stand-alone cluster, supervised app: restart of worker hosting the driver 
> causes app to run twice
> -
>
> Key: SPARK-18159
> URL: https://issues.apache.org/jira/browse/SPARK-18159
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Stephan Kepser
>Priority: Critical
>
> We use Spark in stand-alone cluster mode with HA with three master nodes. All 
> aps are submitted using
> > spark-submit --deploy-mode cluster --supervised --master ...
> We have many apps running. 
> The deploy-mode cluster is needed to prevent the drivers of the apps to be 
> all placed on the active master. 
> If a worker goes down that hosts a driver, the following happens:
> * the driver is started on another worker node
> * the new driver does not connect to the still running app
> * the new driver starts a new instance of the running app
> * there are now two instances of the app running, 
>   * one with an attached new driver,
>   * one without a driver.
> * the old instance of the app cannot effectively be stop. I.e., it can be 
> kill via the UI, but is immediately restarted.
> Iterating this process causes more and more instances of the app running.
> To get the effect both options --deploy-mode cluster and --supervised are 
> required.  
> The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice

2016-10-28 Thread Stephan Kepser (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615024#comment-15615024
 ] 

Stephan Kepser commented on SPARK-18159:


I saw the old executors kept running for several hours (more than 5h). 
And we have a Stand-alone Spark cluster without Yarn or Mesos. Thus using yarn 
to kill the old executors is unfortunately not an option. And killing the old 
executors via the REST API also failed. They are immediately re-started. 

> Stand-alone cluster, supervised app: restart of worker hosting the driver 
> causes app to run twice
> -
>
> Key: SPARK-18159
> URL: https://issues.apache.org/jira/browse/SPARK-18159
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Stephan Kepser
>Priority: Critical
>
> We use Spark in stand-alone cluster mode with HA with three master nodes. All 
> aps are submitted using
> > spark-submit --deploy-mode cluster --supervised --master ...
> We have many apps running. 
> The deploy-mode cluster is needed to prevent the drivers of the apps to be 
> all placed on the active master. 
> If a worker goes down that hosts a driver, the following happens:
> * the driver is started on another worker node
> * the new driver does not connect to the still running app
> * the new driver starts a new instance of the running app
> * there are now two instances of the app running, 
>   * one with an attached new driver,
>   * one without a driver.
> * the old instance of the app cannot effectively be stop. I.e., it can be 
> kill via the UI, but is immediately restarted.
> Iterating this process causes more and more instances of the app running.
> To get the effect both options --deploy-mode cluster and --supervised are 
> required.  
> The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice

2016-10-28 Thread Stephan Kepser (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Kepser updated SPARK-18159:
---
Comment: was deleted

(was: I saw the old executors kept running for several hours (more than 5h). 
And we have a Stand-alone Spark cluster without Yarn or Mesos. Thus using yarn 
to kill the old executors is unfortunately not an option. And killing the old 
executors via the REST API also failed. They are immediately re-started. )

> Stand-alone cluster, supervised app: restart of worker hosting the driver 
> causes app to run twice
> -
>
> Key: SPARK-18159
> URL: https://issues.apache.org/jira/browse/SPARK-18159
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Stephan Kepser
>Priority: Critical
>
> We use Spark in stand-alone cluster mode with HA with three master nodes. All 
> aps are submitted using
> > spark-submit --deploy-mode cluster --supervised --master ...
> We have many apps running. 
> The deploy-mode cluster is needed to prevent the drivers of the apps to be 
> all placed on the active master. 
> If a worker goes down that hosts a driver, the following happens:
> * the driver is started on another worker node
> * the new driver does not connect to the still running app
> * the new driver starts a new instance of the running app
> * there are now two instances of the app running, 
>   * one with an attached new driver,
>   * one without a driver.
> * the old instance of the app cannot effectively be stop. I.e., it can be 
> kill via the UI, but is immediately restarted.
> Iterating this process causes more and more instances of the app running.
> To get the effect both options --deploy-mode cluster and --supervised are 
> required.  
> The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615017#comment-15615017
 ] 

Apache Spark commented on SPARK-14567:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15671

> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14567:


Assignee: Apache Spark  (was: Timothy Hunter)

> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14567:


Assignee: Timothy Hunter  (was: Apache Spark)

> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18150) Spark 2.* failes to create partitions for avro files

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18150.
---
Resolution: Invalid

Please start on the mailing list with a more detailed question, and after 
reviewing the contributing guide.

>  Spark 2.* failes to create partitions for avro files
> -
>
> Key: SPARK-18150
> URL: https://issues.apache.org/jira/browse/SPARK-18150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Priority: Blocker
>
> I am using Apache Spark 2.0.1 for processing the Grid HDFS Avro file, however 
> I don't see spark distributing the job into different tasks instead it uses 
> single task and all the operations (read, load, filter, show ) are done in a 
> sequence using same task.
> This means I am not able to leverage distributed parallel processing.
> I tried the same operation on JSON file on HDFS, it works good, means the job 
> gets distributed into multiple tasks and partition. I see parallelism.
> I then tested the same on Spark 1.6, there it does the partitioning. Looks 
> like there is a bug in Spark 2.* version. If not can some one help me know 
> how to achieve the same on Avro file, do I need to do something special for 
> Avro files ?
> Note:
> I explored spark setting: "spark.default.parallelism",  
> "spark.sql.files.maxPartitionBytes", "--num-executors" and 
> "spark.sql.shuffle.partitions". These were not of much help. 
> "spark.default.parallelism", ensured to have multiple tasks however a single 
> task ended up performing all the operation.
> I am using com.databricks.spark.avro (3.0.1) for Spark 2.0.1.
> Thanks,
> Sunil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18151) CLONE - MetadataLog should support purging old logs

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18151.
---
Resolution: Invalid

> CLONE - MetadataLog should support purging old logs
> ---
>
> Key: SPARK-18151
> URL: https://issues.apache.org/jira/browse/SPARK-18151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Assignee: Peter Lee
> Fix For: 2.1.0, 2.0.1
>
>
> This is a useful primitive operation to have to support checkpointing and 
> forgetting old logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18157) CLONE - Support purging aged file entry for FileStreamSource metadata log

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18157.
---
Resolution: Invalid

> CLONE - Support purging aged file entry for FileStreamSource metadata log
> -
>
> Key: SPARK-18157
> URL: https://issues.apache.org/jira/browse/SPARK-18157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Priority: Minor
>
> Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
> periodically (10 batches by default), this means compacted batch file will 
> contain whole file entries been processed. With the time passed, the 
> compacted batch file will be accumulated to a relative large file. 
> With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
> but in the log we still keep the full records,  this is not necessary and 
> quite time-consuming during recovery. So here propose to also add file entry 
> purging ability to {{FileStreamSource}} metadata log.
> This is pending on SPARK-15698.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18154) CLONE - Change Source API so that sources do not need to keep unbounded state

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18154.
---
Resolution: Invalid

> CLONE - Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-18154
> URL: https://issues.apache.org/jira/browse/SPARK-18154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sunil Kumar
>Assignee: Frederick Reiss
> Fix For: 2.0.3, 2.1.0
>
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18155) CLONE - HDFSMetadataLog should not leak CRC files

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18155.
---
Resolution: Invalid

> CLONE - HDFSMetadataLog should not leak CRC files
> -
>
> Key: SPARK-18155
> URL: https://issues.apache.org/jira/browse/SPARK-18155
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Sunil Kumar
>
> When HDFSMetadataLog uses a log directory on a filesystem other than HDFS 
> (i.e. NFS or the driver node's local filesystem), the class leaves orphan 
> checksum (CRC) files in the log directory. The files have names that follow 
> the pattern "..[long UUID hex string].tmp.crc". These files exist because 
> HDFSMetaDataLog renames other temporary files without renaming the 
> corresponding checksum files. There is one CRC file per batch, so the 
> directory fills up quite quickly.
> I'm not certain, but this problem might also occur on certain versions of the 
> HDFS APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18152) CLONE - FileStreamSource should not track the list of seen files indefinitely

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18152.
---
Resolution: Invalid

> CLONE - FileStreamSource should not track the list of seen files indefinitely
> -
>
> Key: SPARK-18152
> URL: https://issues.apache.org/jira/browse/SPARK-18152
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Assignee: Peter Lee
> Fix For: 2.1.0, 2.0.1
>
>
> FileStreamSource currently tracks all the files seen indefinitely, which 
> means it can run out of memory or overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18153) CLONE - Ability to remove old metadata for structure streaming MetadataLog

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18153.
---
Resolution: Invalid

> CLONE - Ability to remove old metadata for structure streaming MetadataLog
> --
>
> Key: SPARK-18153
> URL: https://issues.apache.org/jira/browse/SPARK-18153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Assignee: Saisai Shao
> Fix For: 2.1.0, 2.0.1
>
>
> Current MetadataLog lacks the ability to remove old checkpoint file, we'd 
> better add this functionality to the MetadataLog and honor it in the place 
> where MetadataLog is used, that will reduce unnecessary small files in the 
> long running scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18156) CLONE - StreamExecution should discard unneeded metadata

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18156.
---
Resolution: Invalid

> CLONE - StreamExecution should discard unneeded metadata
> 
>
> Key: SPARK-18156
> URL: https://issues.apache.org/jira/browse/SPARK-18156
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Sunil Kumar
>Assignee: Frederick Reiss
> Fix For: 2.1.0, 2.0.1
>
>
> The StreamExecution maintains a write-ahead log of batch metadata in order to 
> allow repeating previously in-flight batches if the driver is restarted. 
> StreamExecution does not garbage-collect or compact this log in any way.
> Since the log is implemented with HDFSMetadataLog, these files will consume 
> memory on the HDFS NameNode. Specifically, each log file will consume about 
> 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the 
> block of file contents; see 
> [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
>  An application with a 100 msec batch interval will increase the NameNode's 
> heap usage by about 250MB per day.
> There is also the matter of recovery. StreamExecution reads its entire log 
> when restarting. This read operation will be very expensive if the log 
> contains millions of entries spread over millions of files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen

2016-10-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614959#comment-15614959
 ] 

Kazuaki Ishizaki commented on SPARK-18147:
--

This also cause the same exception.

{code:java}
class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) extends 
Aggregator[Row, B, C] {
  override def reduce(b: B, input: Row): B = b
  override def merge(b1: B, b2: B): B = b1
  override def finish(reduction: B): C = result
  override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
  override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
}
case class Struct2(i: Int)
case class Struct3(a: Struct2, b: Struct2)
Seq(1).toDF("x").groupBy("x")
  .agg(new ComplexResultAgg("a", Struct3(null, null)).toColumn).show
{code}

In this case, there is an issue in codegen for {{AssertNotNull}}. {{isNull1}} 
({{${childGen.isNull}}) is out of the scope when {{ctx.splitExpression}} is 
called at caller of {{AssertNotNull.doGenCode}}. Here, 
{{CreateStruct.doGenCode}} calls {{ctx.splitExpression}}.


> Broken Spark SQL Codegen
> 
>
> Key: SPARK-18147
> URL: https://issues.apache.org/jira/browse/SPARK-18147
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: koert kuipers
>Priority: Critical
>
> this is me on purpose trying to break spark sql codegen to uncover potential 
> issues, by creating arbitrately complex data structures using primitives, 
> strings, basic collections (map, seq, option), tuples, and case classes.
> first example: nested case classes
> code:
> {noformat}
> class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) 
> extends Aggregator[Row, B, C] {
>   override def reduce(b: B, input: Row): B = b
>   override def merge(b1: B, b2: B): B = b1
>   override def finish(reduction: B): C = result
>   override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
>   override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
> }
> case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: 
> Seq[Long] = Seq.empty)
> case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())
> val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
> "y").groupBy("x").agg(
>   new ComplexResultAgg("boo", Struct3()).toColumn
> )
> df1.printSchema
> df1.show
> {noformat}
> the result is:
> {noformat}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
> [info] /* 001 */ public java.lang.Object generate(Object[] references) {
> [info] /* 002 */   return new SpecificMutableProjection(references);
> [info] /* 003 */ }
> [info] /* 004 */
> [info] /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> [info] /* 006 */
> [info] /* 007 */   private Object[] references;
> [info] /* 008 */   private MutableRow mutableRow;
> [info] /* 009 */   private Object[] values;
> [info] /* 010 */   private java.lang.String errMsg;
> [info] /* 011 */   private Object[] values1;
> [info] /* 012 */   private java.lang.String errMsg1;
> [info] /* 013 */   private boolean[] argIsNulls;
> [info] /* 014 */   private scala.collection.Seq argValue;
> [info] /* 015 */   private java.lang.String errMsg2;
> [info] /* 016 */   private boolean[] argIsNulls1;
> [info] /* 017 */   private scala.collection.Seq argValue1;
> [info] /* 018 */   private java.lang.String errMsg3;
> [info] /* 019 */   private java.lang.String errMsg4;
> [info] /* 020 */   private Object[] values2;
> [info] /* 021 */   private java.lang.String errMsg5;
> [info] /* 022 */   private boolean[] argIsNulls2;
> [info] /* 023 */   private scala.collection.Seq argValue2;
> [info] /* 024 */   private java.lang.String errMsg6;
> [info] /* 025 */   private boolean[] argIsNulls3;
> [info] /* 026 */   private scala.collection.Seq argValue3;
> [info] /* 027 */   private java.lang.String errMsg7;
> [info] /* 028 */   private boolean isNull_0;
> [info] /* 029 */   private InternalRow value_0;
> [info] /* 030 */
> [info] /* 031 */   private void apply_1(InternalRow i) {
> [info] /* 032 */
> [info] /* 033 */ if (isNull1) {
> [info] /* 034 */   throw new RuntimeException(errMsg3);
> [info] /* 035 */ }
> [info] /* 036 */
> [info] /* 037 */ boolean isNull24 = false;
> [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
> null : (com.tresata.spark.sql.Struct2) value1.a();
> [info] /* 039 */ isNull24 = value24 == null;
> [info] /* 040 */
> [info] /* 041 */ boolean isNull23 = isNull24;
> [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
> (scala

[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager

2016-10-28 Thread Vivek Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614954#comment-15614954
 ] 

Vivek Gupta commented on SPARK-11278:
-

We are facing an issue with Spark 1.6.0 whereby performance degrades severely  
(with extensive Shuffle spill) using the Unified Memory Manager 
(spark.memory.fraction = 0.9 and spark.memory.storageFraction = 0.0).

However same application works with improved performance having switched to 
legacy mode (spark.memory.useLegacyMode=true, spark.shuffle.memoryFraction=0.9, 
spark.storage.memoryFraction=0.0).

Is this something related with this issue?

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.1
>Reporter: Nishkam Ravi
>Assignee: Andrew Or
>Priority: Critical
> Attachments: executor_log_legacyModeTrue.html, 
> executor_logs_legacyModeFalse.html
>
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice

2016-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614955#comment-15614955
 ] 

Sean Owen commented on SPARK-18159:
---

Yes, that is at least the intended behavior. The new driver can't attach to the 
existing executors. The old executors however should die (eventually). You can 
kill it manually with YARN.

> Stand-alone cluster, supervised app: restart of worker hosting the driver 
> causes app to run twice
> -
>
> Key: SPARK-18159
> URL: https://issues.apache.org/jira/browse/SPARK-18159
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Stephan Kepser
>Priority: Critical
>
> We use Spark in stand-alone cluster mode with HA with three master nodes. All 
> aps are submitted using
> > spark-submit --deploy-mode cluster --supervised --master ...
> We have many apps running. 
> The deploy-mode cluster is needed to prevent the drivers of the apps to be 
> all placed on the active master. 
> If a worker goes down that hosts a driver, the following happens:
> * the driver is started on another worker node
> * the new driver does not connect to the still running app
> * the new driver starts a new instance of the running app
> * there are now two instances of the app running, 
>   * one with an attached new driver,
>   * one without a driver.
> * the old instance of the app cannot effectively be stop. I.e., it can be 
> kill via the UI, but is immediately restarted.
> Iterating this process causes more and more instances of the app running.
> To get the effect both options --deploy-mode cluster and --supervised are 
> required.  
> The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17940) Typo in LAST function error message

2016-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17940.
---
Resolution: Duplicate

> Typo in LAST function error message
> ---
>
> Key: SPARK-17940
> URL: https://issues.apache.org/jira/browse/SPARK-17940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shuai Lin
>Priority: Trivial
>
> https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40
> {code}
>   throw new AnalysisException("The second argument of First should be a 
> boolean literal.")
> {code} 
> "First" should be "Last".
> Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18150) Spark 2.* failes to create partitions for avro files

2016-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614937#comment-15614937
 ] 

Sean Owen commented on SPARK-18150:
---

Whoa, [~sunilsbjoshi], I don't understand why you just copied a bunch of old 
issues? I'm going to close them. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark. You 
should not set blocker.

It's not even clear this is a Spark problem. It sounds like you're using a 
third-party package, and you haven't specified what your code does.

>  Spark 2.* failes to create partitions for avro files
> -
>
> Key: SPARK-18150
> URL: https://issues.apache.org/jira/browse/SPARK-18150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Priority: Blocker
>
> I am using Apache Spark 2.0.1 for processing the Grid HDFS Avro file, however 
> I don't see spark distributing the job into different tasks instead it uses 
> single task and all the operations (read, load, filter, show ) are done in a 
> sequence using same task.
> This means I am not able to leverage distributed parallel processing.
> I tried the same operation on JSON file on HDFS, it works good, means the job 
> gets distributed into multiple tasks and partition. I see parallelism.
> I then tested the same on Spark 1.6, there it does the partitioning. Looks 
> like there is a bug in Spark 2.* version. If not can some one help me know 
> how to achieve the same on Avro file, do I need to do something special for 
> Avro files ?
> Note:
> I explored spark setting: "spark.default.parallelism",  
> "spark.sql.files.maxPartitionBytes", "--num-executors" and 
> "spark.sql.shuffle.partitions". These were not of much help. 
> "spark.default.parallelism", ensured to have multiple tasks however a single 
> task ended up performing all the operation.
> I am using com.databricks.spark.avro (3.0.1) for Spark 2.0.1.
> Thanks,
> Sunil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18133) Python ML Pipeline Example has syntax errors

2016-10-28 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18133.
-
   Resolution: Fixed
 Assignee: Jagadeesan A S
Fix Version/s: 2.1.0

> Python ML Pipeline Example has syntax errors
> 
>
> Key: SPARK-18133
> URL: https://issues.apache.org/jira/browse/SPARK-18133
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Nirmal Fernando
>Assignee: Jagadeesan A S
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.0
>
>
> $ ./bin/spark-submit examples/src/main/python/ml/pipeline_example.py
>   File 
> "/spark-2.0.0-bin-hadoop2.7/examples/src/main/python/ml/pipeline_example.py", 
> line 38
> (0L, "a b c d e spark", 1.0),
>   ^
> SyntaxError: invalid syntax
> Removing 'L' from all occurrences resolves the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18162) SparkEnv.get.metricsSystem in spark-shell results in error: missing or invalid dependency detected while loading class file 'MetricsSystem.class'

2016-10-28 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-18162:
---

 Summary: SparkEnv.get.metricsSystem in spark-shell results in 
error: missing or invalid dependency detected while loading class file 
'MetricsSystem.class'
 Key: SPARK-18162
 URL: https://issues.apache.org/jira/browse/SPARK-18162
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Jacek Laskowski
Priority: Minor


This is with the build today from master.

{code}
$ ./bin/spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
  /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
Branch master
Compiled by user jacek on 2016-10-28T04:05:11Z
Revision ab5f938bc7c3c9b137d63e479fced2b7e9c9d75b
Url https://github.com/apache/spark.git
Type --help for more information.

$ ./bin/spark-shell

scala> SparkEnv.get.metricsSystem
error: missing or invalid dependency detected while loading class file 
'MetricsSystem.class'.
Could not access term eclipse in package org,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the 
problematic classpath.)
A full rebuild may help if 'MetricsSystem.class' was compiled against an 
incompatible version of org.
error: missing or invalid dependency detected while loading class file 
'MetricsSystem.class'.
Could not access term jetty in value org.eclipse,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the 
problematic classpath.)
A full rebuild may help if 'MetricsSystem.class' was compiled against an 
incompatible version of org.eclipse.

scala> spark.version
res3: String = 2.1.0-SNAPSHOT
{code}

I could not find any information about how to set it up in the [official 
documentation|http://spark.apache.org/docs/latest/monitoring.html#metrics].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18148) Misleading Error Message for Aggregation Without Window/GroupBy

2016-10-28 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614773#comment-15614773
 ] 

Jiang Xingbo commented on SPARK-18148:
--

[~pat.mcdono...@databricks.com] I've reproduced this bug, will submit a PR to 
resolve it. Thanks!

> Misleading Error Message for Aggregation Without Window/GroupBy
> ---
>
> Key: SPARK-18148
> URL: https://issues.apache.org/jira/browse/SPARK-18148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Pat McDonough
>
> The following error message points to a random column I'm not actually using 
> in my query, making it hard to diagnose.
> {code}
> org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;
> {code}
> Note in the code below, I forgot to add {{.over(weeklyWindow)}} in the line 
> for {{withColumn("user_count"...}}
> {code}
> spark.read.load("/some-data")
>   .withColumn("date_dt", to_date($"date"))
>   .withColumn("year", year($"date_dt"))
>   .withColumn("week", weekofyear($"date_dt"))
>   .withColumn("user_count", count($"userId"))
>   .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
> )
> {code}
> CC: [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2016-10-28 Thread Sloane Simmons (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614752#comment-15614752
 ] 

Sloane Simmons commented on SPARK-18161:


I changed the importance from minor to major since there is a definite impact 
to this to PySpark users ( programs serializing large objects crash without 
this change ), but there is a workaround ( create a SparkContext with a 
serializer that uses {{pickle.HIGHEST_PROTOCOL}} and monkey-patch 
Broadcast#dumps to do the same ).

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614741#comment-15614741
 ] 

Apache Spark commented on SPARK-18161:
--

User 'singularperturbation' has created a pull request for this issue:
https://github.com/apache/spark/pull/15670

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18161:


Assignee: Apache Spark

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>Assignee: Apache Spark
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18161:


Assignee: (was: Apache Spark)

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18160:


Assignee: Apache Spark

> SparkContext.addFile doesn't work in yarn-cluster mode
> --
>
> Key: SPARK-18160
> URL: https://issues.apache.org/jira/browse/SPARK-18160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Critical
>
> {noformat}
> bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
> examples/target/original-spark-examples_2.11.jar
> {noformat}
> The above command can reproduce the error as following in a multiple node 
> cluster. To be noticed, this issue only happens in multiple node cluster. As 
> in the single node cluster, AM use the same local filesystem as the the 
> driver.
> {noformat}
> 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
> java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.apache.spark.SparkContext.(SparkContext.scala:462)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18160:


Assignee: (was: Apache Spark)

> SparkContext.addFile doesn't work in yarn-cluster mode
> --
>
> Key: SPARK-18160
> URL: https://issues.apache.org/jira/browse/SPARK-18160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Jeff Zhang
>Priority: Critical
>
> {noformat}
> bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
> examples/target/original-spark-examples_2.11.jar
> {noformat}
> The above command can reproduce the error as following in a multiple node 
> cluster. To be noticed, this issue only happens in multiple node cluster. As 
> in the single node cluster, AM use the same local filesystem as the the 
> driver.
> {noformat}
> 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
> java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.apache.spark.SparkContext.(SparkContext.scala:462)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614668#comment-15614668
 ] 

Apache Spark commented on SPARK-18160:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/15669

> SparkContext.addFile doesn't work in yarn-cluster mode
> --
>
> Key: SPARK-18160
> URL: https://issues.apache.org/jira/browse/SPARK-18160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Jeff Zhang
>Priority: Critical
>
> {noformat}
> bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
> examples/target/original-spark-examples_2.11.jar
> {noformat}
> The above command can reproduce the error as following in a multiple node 
> cluster. To be noticed, this issue only happens in multiple node cluster. As 
> in the single node cluster, AM use the same local filesystem as the the 
> driver.
> {noformat}
> 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
> java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.apache.spark.SparkContext.(SparkContext.scala:462)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2016-10-28 Thread Sloane Simmons (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sloane Simmons updated SPARK-18161:
---
Priority: Major  (was: Minor)

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2016-10-28 Thread Sloane Simmons (JIRA)

Sloane Simmons created SPARK-18161:
--

 Summary: Default PickleSerializer pickle protocol doesn't handle > 
4GB objects
 Key: SPARK-18161
 URL: https://issues.apache.org/jira/browse/SPARK-18161
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.1, 2.0.0
Reporter: Sloane Simmons
Priority: Minor


When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
is an error serializing the object with:
{{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
in the stack trace.

This is because Python's pickle serialization (with protocol <= 3) uses a 
32-bit integer for the object size, and so cannot handle objects larger than 4 
gigabytes.  This was changed in Protocol 4 of pickle 
(https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
is available in Python 3.4+.  

I would like to use this protocol for broadcasting and in the default 
PickleSerializer where available to make pyspark more robust to broadcasting 
large variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-28 Thread Jeff Zhang (JIRA)

Jeff Zhang created SPARK-18160:
--

 Summary: SparkContext.addFile doesn't work in yarn-cluster mode
 Key: SPARK-18160
 URL: https://issues.apache.org/jira/browse/SPARK-18160
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 2.0.1, 1.6.2
Reporter: Jeff Zhang
Priority: Critical



{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}
The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18109) Log instrumentation in GMM

2016-10-28 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18109.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Log instrumentation in GMM
> --
>
> Key: SPARK-18109
> URL: https://issues.apache.org/jira/browse/SPARK-18109
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
> Fix For: 2.1.0
>
>
> Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice

2016-10-28 Thread Stephan Kepser (JIRA)

Stephan Kepser created SPARK-18159:
--

 Summary: Stand-alone cluster, supervised app: restart of worker 
hosting the driver causes app to run twice
 Key: SPARK-18159
 URL: https://issues.apache.org/jira/browse/SPARK-18159
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.2
Reporter: Stephan Kepser
Priority: Critical


We use Spark in stand-alone cluster mode with HA with three master nodes. All 
aps are submitted using
> spark-submit --deploy-mode cluster --supervised --master ...
We have many apps running. 
The deploy-mode cluster is needed to prevent the drivers of the apps to be all 
placed on the active master. 

If a worker goes down that hosts a driver, the following happens:
* the driver is started on another worker node
* the new driver does not connect to the still running app
* the new driver starts a new instance of the running app
* there are now two instances of the app running, 
  * one with an attached new driver,
  * one without a driver.
* the old instance of the app cannot effectively be stop. I.e., it can be kill 
via the UI, but is immediately restarted.

Iterating this process causes more and more instances of the app running.
To get the effect both options --deploy-mode cluster and --supervised are 
required.  
The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18150) Spark 2.* failes to create partitions for avro files

2016-10-28 Thread Sunil Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614587#comment-15614587
 ] 

Sunil Kumar commented on SPARK-18150:
-

I am using Spark : 2.0.0.24  and spark-avro : 2.11-3.0.1. 

>  Spark 2.* failes to create partitions for avro files
> -
>
> Key: SPARK-18150
> URL: https://issues.apache.org/jira/browse/SPARK-18150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Priority: Blocker
>
> I am using Apache Spark 2.0.1 for processing the Grid HDFS Avro file, however 
> I don't see spark distributing the job into different tasks instead it uses 
> single task and all the operations (read, load, filter, show ) are done in a 
> sequence using same task.
> This means I am not able to leverage distributed parallel processing.
> I tried the same operation on JSON file on HDFS, it works good, means the job 
> gets distributed into multiple tasks and partition. I see parallelism.
> I then tested the same on Spark 1.6, there it does the partitioning. Looks 
> like there is a bug in Spark 2.* version. If not can some one help me know 
> how to achieve the same on Avro file, do I need to do something special for 
> Avro files ?
> Note:
> I explored spark setting: "spark.default.parallelism",  
> "spark.sql.files.maxPartitionBytes", "--num-executors" and 
> "spark.sql.shuffle.partitions". These were not of much help. 
> "spark.default.parallelism", ensured to have multiple tasks however a single 
> task ended up performing all the operation.
> I am using com.databricks.spark.avro (3.0.1) for Spark 2.0.1.
> Thanks,
> Sunil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18150) Spark 2.* failes to create partitions for avro files

2016-10-28 Thread Sunil Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Kumar updated SPARK-18150:

Description: 
I am using Apache Spark 2.0.1 for processing the Grid HDFS Avro file, however I 
don't see spark distributing the job into different tasks instead it uses 
single task and all the operations (read, load, filter, show ) are done in a 
sequence using same task.

This means I am not able to leverage distributed parallel processing.

I tried the same operation on JSON file on HDFS, it works good, means the job 
gets distributed into multiple tasks and partition. I see parallelism.

I then tested the same on Spark 1.6, there it does the partitioning. Looks like 
there is a bug in Spark 2.* version. If not can some one help me know how to 
achieve the same on Avro file, do I need to do something special for Avro files 
?


Note:
I explored spark setting: "spark.default.parallelism",  
"spark.sql.files.maxPartitionBytes", "--num-executors" and 
"spark.sql.shuffle.partitions". These were not of much help. 
"spark.default.parallelism", ensured to have multiple tasks however a single 
task ended up performing all the operation.

I am using com.databricks.spark.avro (3.0.1) for Spark 2.0.1.

Thanks,
Sunil

  was:
This is an umbrella ticket to track various things that are required in order 
to have the engine for structured streaming run non-stop in production.



>  Spark 2.* failes to create partitions for avro files
> -
>
> Key: SPARK-18150
> URL: https://issues.apache.org/jira/browse/SPARK-18150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Sunil Kumar
>Priority: Blocker
>
> I am using Apache Spark 2.0.1 for processing the Grid HDFS Avro file, however 
> I don't see spark distributing the job into different tasks instead it uses 
> single task and all the operations (read, load, filter, show ) are done in a 
> sequence using same task.
> This means I am not able to leverage distributed parallel processing.
> I tried the same operation on JSON file on HDFS, it works good, means the job 
> gets distributed into multiple tasks and partition. I see parallelism.
> I then tested the same on Spark 1.6, there it does the partitioning. Looks 
> like there is a bug in Spark 2.* version. If not can some one help me know 
> how to achieve the same on Avro file, do I need to do something special for 
> Avro files ?
> Note:
> I explored spark setting: "spark.default.parallelism",  
> "spark.sql.files.maxPartitionBytes", "--num-executors" and 
> "spark.sql.shuffle.partitions". These were not of much help. 
> "spark.default.parallelism", ensured to have multiple tasks however a single 
> task ended up performing all the operation.
> I am using com.databricks.spark.avro (3.0.1) for Spark 2.0.1.
> Thanks,
> Sunil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18124) Implement watermarking for handling late data

2016-10-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-18124:


Assignee: Michael Armbrust  (was: Tathagata Das)

> Implement watermarking for handling late data
> -
>
> Key: SPARK-18124
> URL: https://issues.apache.org/jira/browse/SPARK-18124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Michael Armbrust
>
> Whenever we aggregate data by event time, we want to consider data is late 
> and out-of-order in terms of its event time. Since we keep aggregate keyed by 
> the time as state, the state will grow unbounded if we keep around all old 
> aggregates in an attempt consider arbitrarily late data. Since the state is a 
> store in-memory, we have to prevent building up of this unbounded state. 
> Hence, we need a watermarking mechanism by which we will mark data that is 
> older beyond a threshold as “too late”, and stop updating the aggregates with 
> them. This would allow us to remove old aggregates that are never going to be 
> updated, thus bounding the size of the state.
> Here is the design doc - 
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-10-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614576#comment-15614576
 ] 

Rémi Delassus commented on SPARK-17055:
---

I had an issue that could be solved by this kind of technique : Each sample got 
a timesamp and the error can only be computed on a full month of data. Thus I 
need months of data to be distributed in folds, no each sample individually.

But in my opinion this is not the good way to solve that. Since there is an 
infinite number of ways to split the data, I think we should be able to pass 
the split method as an argument to the crossvalidator. The method described 
here could be implemented, as well as any other.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18158) Submit app in standalone cluster mode supervised with HA: all masters have to be up and running

2016-10-28 Thread Stephan Kepser (JIRA)

Stephan Kepser created SPARK-18158:
--

 Summary: Submit app in standalone cluster mode supervised with HA: 
all masters have to be up and running
 Key: SPARK-18158
 URL: https://issues.apache.org/jira/browse/SPARK-18158
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.2
Reporter: Stephan Kepser


We use Spark in Stand-alone cluster mode with HA via three masters. If one 
submits an app with
> spark-submit --deploy-mode cluster --supervised --master node1,node2,node3
all masters (node1,nod2,nod3) must be up and running. If one Standby master is 
down, spark-submit fails. If you remove the master that's down (say node2) from 
the list of masters as in
> spark-submit --deploy-mode cluster --supervised --master node1,node3
the submit succeeds.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18151) CLONE - MetadataLog should support purging old logs

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18151:
---

 Summary: CLONE - MetadataLog should support purging old logs
 Key: SPARK-18151
 URL: https://issues.apache.org/jira/browse/SPARK-18151
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Sunil Kumar
Assignee: Peter Lee
 Fix For: 2.0.1, 2.1.0


This is a useful primitive operation to have to support checkpointing and 
forgetting old logs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18154) CLONE - Change Source API so that sources do not need to keep unbounded state

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18154:
---

 Summary: CLONE - Change Source API so that sources do not need to 
keep unbounded state
 Key: SPARK-18154
 URL: https://issues.apache.org/jira/browse/SPARK-18154
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 2.0.0, 2.0.1
Reporter: Sunil Kumar
Assignee: Frederick Reiss
 Fix For: 2.0.3, 2.1.0


The version of the Source API in Spark 2.0.0 defines a single getBatch() method 
for fetching records from the source, with the following Scaladoc comments 
defining the semantics:

{noformat}
/**
 * Returns the data that is between the offsets (`start`, `end`]. When `start` 
is `None` then
 * the batch should begin with the first available record. This method must 
always return the
 * same data for a particular `start` and `end` pair.
 */
def getBatch(start: Option[Offset], end: Offset): DataFrame
{noformat}
These semantics mean that a Source must retain all past history for the stream 
that it backs. Further, a Source is also required to retain this data across 
restarts of the process where the Source is instantiated, even when the Source 
is restarted on a different machine.
These restrictions make it difficult to implement the Source API, as any 
implementation requires potentially unbounded amounts of distributed storage.
See the mailing list thread at 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
 for more information.
This JIRA will cover augmenting the Source API with an additional callback that 
will allow Structured Streaming scheduler to notify the source when it is safe 
to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18155) CLONE - HDFSMetadataLog should not leak CRC files

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18155:
---

 Summary: CLONE - HDFSMetadataLog should not leak CRC files
 Key: SPARK-18155
 URL: https://issues.apache.org/jira/browse/SPARK-18155
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Sunil Kumar


When HDFSMetadataLog uses a log directory on a filesystem other than HDFS (i.e. 
NFS or the driver node's local filesystem), the class leaves orphan checksum 
(CRC) files in the log directory. The files have names that follow the pattern 
"..[long UUID hex string].tmp.crc". These files exist because HDFSMetaDataLog 
renames other temporary files without renaming the corresponding checksum 
files. There is one CRC file per batch, so the directory fills up quite quickly.

I'm not certain, but this problem might also occur on certain versions of the 
HDFS APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18152) CLONE - FileStreamSource should not track the list of seen files indefinitely

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18152:
---

 Summary: CLONE - FileStreamSource should not track the list of 
seen files indefinitely
 Key: SPARK-18152
 URL: https://issues.apache.org/jira/browse/SPARK-18152
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Sunil Kumar
Assignee: Peter Lee
 Fix For: 2.0.1, 2.1.0


FileStreamSource currently tracks all the files seen indefinitely, which means 
it can run out of memory or overflow.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18150) Spark 2.* failes to create partitions for avro files

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18150:
---

 Summary:  Spark 2.* failes to create partitions for avro files
 Key: SPARK-18150
 URL: https://issues.apache.org/jira/browse/SPARK-18150
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Reporter: Sunil Kumar
Priority: Blocker


This is an umbrella ticket to track various things that are required in order 
to have the engine for structured streaming run non-stop in production.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18157) CLONE - Support purging aged file entry for FileStreamSource metadata log

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18157:
---

 Summary: CLONE - Support purging aged file entry for 
FileStreamSource metadata log
 Key: SPARK-18157
 URL: https://issues.apache.org/jira/browse/SPARK-18157
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Sunil Kumar
Priority: Minor


Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
periodically (10 batches by default), this means compacted batch file will 
contain whole file entries been processed. With the time passed, the compacted 
batch file will be accumulated to a relative large file. 

With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
but in the log we still keep the full records,  this is not necessary and quite 
time-consuming during recovery. So here propose to also add file entry purging 
ability to {{FileStreamSource}} metadata log.

This is pending on SPARK-15698.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18156) CLONE - StreamExecution should discard unneeded metadata

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18156:
---

 Summary: CLONE - StreamExecution should discard unneeded metadata
 Key: SPARK-18156
 URL: https://issues.apache.org/jira/browse/SPARK-18156
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Sunil Kumar
Assignee: Frederick Reiss
 Fix For: 2.0.1, 2.1.0


The StreamExecution maintains a write-ahead log of batch metadata in order to 
allow repeating previously in-flight batches if the driver is restarted. 
StreamExecution does not garbage-collect or compact this log in any way.

Since the log is implemented with HDFSMetadataLog, these files will consume 
memory on the HDFS NameNode. Specifically, each log file will consume about 300 
bytes of NameNode memory (150 bytes for the inode and 150 bytes for the block 
of file contents; see 
[https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
 An application with a 100 msec batch interval will increase the NameNode's 
heap usage by about 250MB per day.

There is also the matter of recovery. StreamExecution reads its entire log when 
restarting. This read operation will be very expensive if the log contains 
millions of entries spread over millions of files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18153) CLONE - Ability to remove old metadata for structure streaming MetadataLog

2016-10-28 Thread Sunil Kumar (JIRA)

Sunil Kumar created SPARK-18153:
---

 Summary: CLONE - Ability to remove old metadata for structure 
streaming MetadataLog
 Key: SPARK-18153
 URL: https://issues.apache.org/jira/browse/SPARK-18153
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Sunil Kumar
Assignee: Saisai Shao
 Fix For: 2.0.1, 2.1.0


Current MetadataLog lacks the ability to remove old checkpoint file, we'd 
better add this functionality to the MetadataLog and honor it in the place 
where MetadataLog is used, that will reduce unnecessary small files in the long 
running scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

100 matches

Mail list logo