[jira] [Assigned] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10534:


Assignee: (was: Apache Spark)

> ORDER BY clause allows only columns that are present in SELECT statement
> 
>
> Key: SPARK-10534
> URL: https://issues.apache.org/jira/browse/SPARK-10534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michal Cwienczek
>
> When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) 
> Spark 1.5 throws exception:
> {code}
> cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given 
> input columns EmployeeID; line 2 pos 14 StackTrace: 
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns 
> EmployeeID; line 2 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  

[jira] [Assigned] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11078:


Assignee: Apache Spark  (was: Andrew Or)

> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11078:


Assignee: Andrew Or  (was: Apache Spark)

> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-14 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-7:
--

 Summary: PhysicalRDD.outputsUnsafeRows should return true when the 
underlying data source produces UnsafeRows
 Key: SPARK-7
 URL: https://issues.apache.org/jira/browse/SPARK-7
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.1
Reporter: Cheng Lian
Assignee: Cheng Lian


{{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus 
can't avoid {{ConvertToUnsafe}} when upper level operators only support 
{{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7:


Assignee: Apache Spark  (was: Cheng Lian)

> PhysicalRDD.outputsUnsafeRows should return true when the underlying data 
> source produces UnsafeRows
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> {{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus 
> can't avoid {{ConvertToUnsafe}} when upper level operators only support 
> {{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7:


Assignee: Cheng Lian  (was: Apache Spark)

> PhysicalRDD.outputsUnsafeRows should return true when the underlying data 
> source produces UnsafeRows
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> {{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus 
> can't avoid {{ConvertToUnsafe}} when upper level operators only support 
> {{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957951#comment-14957951
 ] 

Apache Spark commented on SPARK-7:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9125

> PhysicalRDD.outputsUnsafeRows should return true when the underlying data 
> source produces UnsafeRows
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> {{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus 
> can't avoid {{ConvertToUnsafe}} when upper level operators only support 
> {{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-14 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957806#comment-14957806
 ] 

Alex Liu commented on SPARK-11067:
--

Let's keep this ticket simply fix the exception. We could implement further  
improvement  on handling decimal in another ticket if it's involved too many 
changes.

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11115) IPv6 regression

2015-10-14 Thread Thomas Dudziak (JIRA)
Thomas Dudziak created SPARK-5:
--

 Summary: IPv6 regression
 Key: SPARK-5
 URL: https://issues.apache.org/jira/browse/SPARK-5
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
 Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
Reporter: Thomas Dudziak
Priority: Critical


When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:

15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
java.lang.AssertionError: assertion failed: Expected hostname
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
at 
org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
at 
org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
at org.apache.spark.SparkContext.(SparkContext.scala:528)
at 
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)

Looking at the code in question, it seems that the code will only work for IPv4 
as it assumes ':' can't be part of the hostname (which it clearly can for IPv6 
addresses).
Instead, the code should probably use Guava's HostAndPort class, i.e.:

  def checkHost(host: String, message: String = "") {
assert(!HostAndPort.fromString(host).hasPort, message)
  }

  def checkHostPort(hostPort: String, message: String = "") {
assert(HostAndPort.fromString(hostPort).hasPort, message)
  }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11027) Better group distinct columns in query compilation

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11027:


Assignee: Apache Spark

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11027) Better group distinct columns in query compilation

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11027:


Assignee: (was: Apache Spark)

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11027) Better group distinct columns in query compilation

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956672#comment-14956672
 ] 

Apache Spark commented on SPARK-11027:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9115

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11049) If a single executor fails to allocate memory, entire job fails

2015-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11049.
---
Resolution: Not A Problem

Pending more info

> If a single executor fails to allocate memory, entire job fails
> ---
>
> Key: SPARK-11049
> URL: https://issues.apache.org/jira/browse/SPARK-11049
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Brian
>
> To reproduce:
> * Create a spark cluster using start-master.sh and start-slave.sh (I believe 
> this is the "standalone cluster manager?").  
> * Leave a process running on some nodes that take up about significant 
> amounts of RAM.
> * Leave some nodes with plenty of RAM to run spark.
> * Run a job against this cluster with spark.executor.memory asking for all or 
> most of the memory available on each node.
> On the node that has insufficient memory, there will of course be an error 
> like:
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> Could not create the Java virtual machine.
> On the driver node, and in the spark master UI, I see that _all_ executors 
> exit or are killed, and the entire job fails.  It would be better if there 
> was an indication of which individual node is actually at fault.  It would 
> also be better if the cluster manager could handle failing-over to nodes that 
> are still operating properly and have sufficient RAM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11058) failed spark job reports on YARN as successful

2015-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956640#comment-14956640
 ] 

Sean Owen commented on SPARK-11058:
---

I suspect that either: your job didn't actually fail at the driver level? or, 
this is in fact the same problem handling the ".inprogress" file reported in 
other JIRAs and fixed since 1.3.x. Are you able to try 1.5.x?

> failed spark job reports on YARN as successful
> --
>
> Key: SPARK-11058
> URL: https://issues.apache.org/jira/browse/SPARK-11058
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: CDH 5.4
>Reporter: Lan Jiang
>Priority: Minor
>
> I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in 
> “yarn-client” mode. The job itself failed due to YARN kills several executor 
> containers because the containers exceeded the memory limit posed by YARN. 
> However, when I went to the YARN resource manager site, it displayed the job 
> as successful. I found there was an issue reported in JIRA 
> https://issues.apache.org/jira/browse/SPARK-3627, but it says it was fixed in 
> Spark 1.2. On Spark history server, it shows the job as “Incomplete”. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957984#comment-14957984
 ] 

Apache Spark commented on SPARK-0:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/9126

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>Priority: Critical
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-0:


Assignee: Apache Spark  (was: Jakob Odersky)

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Apache Spark
>Priority: Critical
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-0:


Assignee: Jakob Odersky  (was: Apache Spark)

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>Priority: Critical
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10984) Simplify *MemoryManager class structure

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10984:


Assignee: Josh Rosen  (was: Apache Spark)

> Simplify *MemoryManager class structure
> ---
>
> Key: SPARK-10984
> URL: https://issues.apache.org/jira/browse/SPARK-10984
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Josh Rosen
>
> This is a refactoring task.
> After SPARK-10956 gets merged, we will have the following:
> - MemoryManager
> - StaticMemoryManager
> - ExecutorMemoryManager
> - TaskMemoryManager
> - ShuffleMemoryManager
> This is pretty confusing. The goal is to merge ShuffleMemoryManager and 
> ExecutorMemoryManager and move them into the top-level MemoryManager abstract 
> class. Then TaskMemoryManager should be renamed something else and used by 
> MemoryManager, such that the new hierarchy becomes:
> - MemoryManager
> - StaticMemoryManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10984) Simplify *MemoryManager class structure

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958044#comment-14958044
 ] 

Apache Spark commented on SPARK-10984:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9127

> Simplify *MemoryManager class structure
> ---
>
> Key: SPARK-10984
> URL: https://issues.apache.org/jira/browse/SPARK-10984
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Josh Rosen
>
> This is a refactoring task.
> After SPARK-10956 gets merged, we will have the following:
> - MemoryManager
> - StaticMemoryManager
> - ExecutorMemoryManager
> - TaskMemoryManager
> - ShuffleMemoryManager
> This is pretty confusing. The goal is to merge ShuffleMemoryManager and 
> ExecutorMemoryManager and move them into the top-level MemoryManager abstract 
> class. Then TaskMemoryManager should be renamed something else and used by 
> MemoryManager, such that the new hierarchy becomes:
> - MemoryManager
> - StaticMemoryManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10984) Simplify *MemoryManager class structure

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10984:


Assignee: Apache Spark  (was: Josh Rosen)

> Simplify *MemoryManager class structure
> ---
>
> Key: SPARK-10984
> URL: https://issues.apache.org/jira/browse/SPARK-10984
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> This is a refactoring task.
> After SPARK-10956 gets merged, we will have the following:
> - MemoryManager
> - StaticMemoryManager
> - ExecutorMemoryManager
> - TaskMemoryManager
> - ShuffleMemoryManager
> This is pretty confusing. The goal is to merge ShuffleMemoryManager and 
> ExecutorMemoryManager and move them into the top-level MemoryManager abstract 
> class. Then TaskMemoryManager should be renamed something else and used by 
> MemoryManager, such that the new hierarchy becomes:
> - MemoryManager
> - StaticMemoryManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11017) Support ImperativeAggregates in TungstenAggregate

2015-10-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11017.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9038
[https://github.com/apache/spark/pull/9038]

> Support ImperativeAggregates in TungstenAggregate
> -
>
> Key: SPARK-11017
> URL: https://issues.apache.org/jira/browse/SPARK-11017
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> The TungstenAggregate operator currently only supports DeclarativeAggregate 
> functions (i.e. expression-based aggregates); we should extend it to also 
> support ImperativeAggregate functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11115) IPv6 regression

2015-10-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958078#comment-14958078
 ] 

Patrick Wendell edited comment on SPARK-5 at 10/15/15 12:38 AM:


The title of this says "Regression" - did it regress from a previous version? I 
am going to update the title, let me know if there is any issue.


was (Author: pwendell):
The title of this says "Regression" - did it regression from a previous 
version? I am going to update the title, let me know if there is any issue.

> IPv6 regression
> ---
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11121) Incorrect TaskLocation type

2015-10-14 Thread zhichao-li (JIRA)
zhichao-li created SPARK-11121:
--

 Summary: Incorrect TaskLocation type
 Key: SPARK-11121
 URL: https://issues.apache.org/jira/browse/SPARK-11121
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: zhichao-li
Priority: Minor


"toString" is the only difference between HostTaskLocation and 
HDFSCacheTaskLocation for the moment, but it would be better to correct this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11081) Shade Jersey dependency to work around the compatibility issue with Jersey2

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11081:

Component/s: Build

> Shade Jersey dependency to work around the compatibility issue with Jersey2
> ---
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11092:

Assignee: Jakob Odersky

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Assignee: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-10-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10829.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8916
[https://github.com/apache/spark/pull/8916]

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Critical
> Fix For: 1.6.0
>
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11118) PredictionModel.transform is calling incorrect transformSchema method

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-8:
-

 Summary: PredictionModel.transform is calling incorrect 
transformSchema method
 Key: SPARK-8
 URL: https://issues.apache.org/jira/browse/SPARK-8
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


PredictionModel.transform calls the transformSchema defined in PipelineStage.  
It should instead call the one defined in PredictionModel, which does not take 
the logging parameter.

The current implementation forces there to be a "label" column, even during 
prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11118) PredictionModel.transform is calling incorrect transformSchema method

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-8.
-
Resolution: Invalid

Oops, cancel that.  The issue I was running into must be from elsewhere...

> PredictionModel.transform is calling incorrect transformSchema method
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> PredictionModel.transform calls the transformSchema defined in PipelineStage. 
>  It should instead call the one defined in PredictionModel, which does not 
> take the logging parameter.
> The current implementation forces there to be a "label" column, even during 
> prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956988#comment-14956988
 ] 

Hyukjin Kwon edited comment on SPARK-11103 at 10/15/15 12:13 AM:
-

I tested this case. The problem was, Parquet filters are pushed down regardless 
of each schema of the splits (or rather files).

Would the predicate pushdown need to be prevented when using mergeSchema option?


was (Author: hyukjin.kwon):
I tested this case. The problem was, Parquet filters are pushed down regardless 
of each schema of the splits (or rather files).

Would the predicate pushdown be prevented when using mergeSchema option?

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> 

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958054#comment-14958054
 ] 

Xiao Li commented on SPARK-10925:
-

I have not tried Spark 1.4, but inner joining 2 tables with the same column 
names will not automatically merge them in most commercial RDBMS. They are 
treated as separate columns, even if the column names are the same. However, 
based on SQL standard, natural join combines the columns with the same names. 

For example, in your test case, you can try this:

val df = sqlContext.createDataFrame(rdd)
val df1 = df;
val df2 = df1;
val df3 = df1.join(df2, df1("name") === df2("name"))
val df4 = df3.join(df2, df3("name") === df2("name"))
df4.show()

The exception you should get is like 

Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
'name' is ambiguous, could be: name#1, name#5.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:672)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:660)
at SimpleApp$.main(SimpleApp.scala:49)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)





> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Updated] (SPARK-11115) Host verification is not correct for IPv6

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5:

Summary: Host verification is not correct for IPv6  (was: IPv6 regression)

> Host verification is not correct for IPv6
> -
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11115) IPv6 regression

2015-10-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958078#comment-14958078
 ] 

Patrick Wendell commented on SPARK-5:
-

The title of this says "Regression" - did it regression from a previous 
version? I am going to update the title, let me know if there is any issue.

> IPv6 regression
> ---
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958103#comment-14958103
 ] 

Marcelo Vanzin commented on SPARK-11098:


So, while working on another patch in this area, I ran into this issue, and I 
don't think it's a problem in the RPC layer, but rather a problem of the code 
calling the RPC layer.

Even if somehow you synchronize things in the RPC env implementation so that 
RPCs are sent in the order they arrive, there are multiple threads that can be 
calling {{RpcEndpoint.send()}} or {{RpcEndpoint.ask()}} at the same time, and 
at that point there's not guarantee of any order.

The problem I ran into explicitly was the Worker ignoring messages from the 
Master because it thought the master was not active. That's because those 
messages were arriving before the master had replied to the Worker's 
registration message. That's not the fault of the RPC layer, that's the fault 
of that reply being sent to the Worker as a separate message, instead of an RPC 
reply to the {{RegisterWorker}} message. {{Worker}} in this case should be 
using {{ask}} and getting the reply from that ask; that ensures the reply will 
arrive before any other messages the Master may want to send to the worker.

If you want to see how to do that properly, see how 
{{CoarseGrainedExecutorBackend}} does its registration with the scheduler using 
{{ask}} instead of {{send}}.

Anyway, I have that fixed in my patch, I might take it out as a separate fix 
and attach it to this bug. But I'm not sure if other areas of the code don't 
suffer from the same problem.

> RPC message ordering is not guaranteed
> --
>
> Key: SPARK-11098
> URL: https://issues.apache.org/jira/browse/SPARK-11098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11006:

Component/s: SQL

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11122) Fatal warnings in sbt are not displayed as such

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11122:


Assignee: Apache Spark

> Fatal warnings in sbt are not displayed as such
> ---
>
> Key: SPARK-11122
> URL: https://issues.apache.org/jira/browse/SPARK-11122
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Jakob Odersky
>Assignee: Apache Spark
>
> The sbt script treats warnings (except dependency warnings) as errors, 
> however there is no visual difference between errors and fatal warnings, thus 
> leading to very confusing debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11122) Fatal warnings in sbt are not displayed as such

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11122:


Assignee: (was: Apache Spark)

> Fatal warnings in sbt are not displayed as such
> ---
>
> Key: SPARK-11122
> URL: https://issues.apache.org/jira/browse/SPARK-11122
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Jakob Odersky
>
> The sbt script treats warnings (except dependency warnings) as errors, 
> however there is no visual difference between errors and fatal warnings, thus 
> leading to very confusing debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958022#comment-14958022
 ] 

Sandy Ryza commented on SPARK-:
---

bq. The problem with doing this using a registry (like kryo in RDDs today) is 
that then you aren't finding out the object type until you have an example 
object from realizing the computation.

My suggestion was that the user would still need to pass the class object, so 
this shouldn't be a problem, unless I'm misunderstanding.

Thanks to the pointer to the test suite.  So am I to understand correctly that 
with Scala implicits magic I can do the following without any additional 
boilerplate?

{code}
import 

case class MyClass1()
case class MyClass2()

val ds : Dataset[MyClass1] = ...
val ds2: Dataset[MyClass2] = ds.map(funcThatConvertsFromMyClass1ToMyClass2)
{code}

and in Java, imagining those case classes above were POJOs, we'd be able to 
support the following?

{code}
Dataset ds2 = ds1.map(funcThatConvertsFromMyClass1ToMyClass2, 
MyClass2.class);
{code}

If that's the case, then that resolves my concerns above.

Lastly, though, IIUC, it seems like for all the common cases, we could register 
an object with the SparkContext that converts from ClassTag to Encoder, and the 
RDD API would work.  Where does that break down?

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1:

Component/s: SQL

> Fast null-safe join
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Today, null safe joins are executed with a Cartesian product.
> {code}
> scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
> == Physical Plan ==
> TungstenProject [i#2,j#3,i#7,j#8]
>  Filter (i#2 <=> i#7)
>   CartesianProduct
>LocalTableScan [i#2,j#3], [[1,1]]
>LocalTableScan [i#7,j#8], [[1,1]]
> {code}
> One option is to add this rewrite to the optimizer:
> {code}
> select * 
> from t a 
> join t b 
>   on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
> {code}
> Acceptance criteria: joins with only null safe equality should not result in 
> a Cartesian product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11056) Improve documentation on how to build Spark efficiently

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11056:

Component/s: Documentation

> Improve documentation on how to build Spark efficiently
> ---
>
> Key: SPARK-11056
> URL: https://issues.apache.org/jira/browse/SPARK-11056
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> Slow build times are a common pain point for new Spark developers.  We should 
> improve the main documentation on building Spark to describe how to make 
> building Spark less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11121) Incorrect TaskLocation type

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11121:


Assignee: Apache Spark

> Incorrect TaskLocation type
> ---
>
> Key: SPARK-11121
> URL: https://issues.apache.org/jira/browse/SPARK-11121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhichao-li
>Assignee: Apache Spark
>Priority: Minor
>
> "toString" is the only difference between HostTaskLocation and 
> HDFSCacheTaskLocation for the moment, but it would be better to correct this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11121) Incorrect TaskLocation type

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11121:


Assignee: (was: Apache Spark)

> Incorrect TaskLocation type
> ---
>
> Key: SPARK-11121
> URL: https://issues.apache.org/jira/browse/SPARK-11121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhichao-li
>Priority: Minor
>
> "toString" is the only difference between HostTaskLocation and 
> HDFSCacheTaskLocation for the moment, but it would be better to correct this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11121) Incorrect TaskLocation type

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958101#comment-14958101
 ] 

Apache Spark commented on SPARK-11121:
--

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/9096

> Incorrect TaskLocation type
> ---
>
> Key: SPARK-11121
> URL: https://issues.apache.org/jira/browse/SPARK-11121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhichao-li
>Priority: Minor
>
> "toString" is the only difference between HostTaskLocation and 
> HDFSCacheTaskLocation for the moment, but it would be better to correct this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-14 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958109#comment-14958109
 ] 

Dilip Biswal commented on SPARK-10943:
--

Hi Jason,

>From the parquet format page , here are the data types thats supported in 
>parquet.

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.

In your test case , you are trying to write an un-typed null value and there is 
no
mapping between this  type (NullType) to the builtin types supported by parquet.

[~marmbrus] is this a valid scenario ?

Regards,
-- Dilip


> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> //FAIL - Try writing a NullType column (where all the values are NULL)
> data02.write.parquet("/tmp/celtra-test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> 

[jira] [Created] (SPARK-11122) Fatal warnings in sbt are not displayed as such

2015-10-14 Thread Jakob Odersky (JIRA)
Jakob Odersky created SPARK-11122:
-

 Summary: Fatal warnings in sbt are not displayed as such
 Key: SPARK-11122
 URL: https://issues.apache.org/jira/browse/SPARK-11122
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Jakob Odersky


The sbt script treats warnings (except dependency warnings) as errors, however 
there is no visual difference between errors and fatal warnings, thus leading 
to very confusing debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11122) Fatal warnings in sbt are not displayed as such

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958147#comment-14958147
 ] 

Apache Spark commented on SPARK-11122:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/9128

> Fatal warnings in sbt are not displayed as such
> ---
>
> Key: SPARK-11122
> URL: https://issues.apache.org/jira/browse/SPARK-11122
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Jakob Odersky
>
> The sbt script treats warnings (except dependency warnings) as errors, 
> however there is no visual difference between errors and fatal warnings, thus 
> leading to very confusing debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11076) Decimal Support for Ceil/Floor

2015-10-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11076.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9086
[https://github.com/apache/spark/pull/9086]

> Decimal Support for Ceil/Floor
> --
>
> Key: SPARK-11076
> URL: https://issues.apache.org/jira/browse/SPARK-11076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
> Fix For: 1.6.0
>
>
> Currently, Ceil & Floor doesn't support decimal, but Hive does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11123) Inprove HistoryServer with multithread to relay logs

2015-10-14 Thread Xie Tingwen (JIRA)
Xie Tingwen created SPARK-11123:
---

 Summary: Inprove HistoryServer with multithread to relay logs
 Key: SPARK-11123
 URL: https://issues.apache.org/jira/browse/SPARK-11123
 Project: Spark
  Issue Type: Improvement
Reporter: Xie Tingwen


Now,with Spark 1.4,when I restart HistoryServer,it took over 30 hours to replay 
over 40 000 log file. What's more,when I have started it,it may take half an 
hour to relay it and block other logs to be replayed.How about rewrite it with 
multithread to accelerate replay log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11124) JsonParser/Generator should be closed for resource reycle

2015-10-14 Thread Navis (JIRA)
Navis created SPARK-11124:
-

 Summary: JsonParser/Generator should be closed for resource reycle
 Key: SPARK-11124
 URL: https://issues.apache.org/jira/browse/SPARK-11124
 Project: Spark
  Issue Type: Bug
Reporter: Navis
Priority: Trivial


Some json parsers are not closed. parser in JacksonParser#parseJson, for 
example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-14 Thread Navis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navis updated SPARK-11124:
--
Summary: JsonParser/Generator should be closed for resource recycle  (was: 
JsonParser/Generator should be closed for resource reycle)

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>Reporter: Navis
>Priority: Trivial
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11124:


Assignee: Apache Spark

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>Reporter: Navis
>Assignee: Apache Spark
>Priority: Trivial
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958293#comment-14958293
 ] 

Apache Spark commented on SPARK-11124:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/9130

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>Reporter: Navis
>Priority: Trivial
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11124:


Assignee: (was: Apache Spark)

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>Reporter: Navis
>Priority: Trivial
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957281#comment-14957281
 ] 

Davies Liu commented on SPARK-11083:


Maybe this one: https://github.com/apache/spark/pull/8909, it uses a separate 
session for each connection.

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
> Fix For: 1.6.0
>
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
>  to destination 
> 

[jira] [Resolved] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11083.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.6.0

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
>  to destination 
> 

[jira] [Updated] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7425:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11107

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11107) spark.ml should support more input column types: umbrella

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11107:
-

 Summary: spark.ml should support more input column types: umbrella
 Key: SPARK-11107
 URL: https://issues.apache.org/jira/browse/SPARK-11107
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


This is an umbrella for expanding the set of data types which spark.ml Pipeline 
stages can take.  This should not involve breaking APIs, but merely involve 
slight changes such as supporting all Numeric types instead of just Double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957317#comment-14957317
 ] 

Dominic Ricard edited comment on SPARK-11103 at 10/14/15 5:24 PM:
--

Strangely enough, this works perfectly:
{noformat}
select 
  col1
from
  `table3`
where
  (CASE WHEN col2 = 2 THEN true ELSE false END) = true;
{noformat}

And returns only the row that contains col2 = 2


was (Author: dricard):
Strangely enough, this works perfectly:
{noformat}
select col1 from `table3` where CASE WHEN col2 = 2 THEN true ELSE false END = 
true;
{noformat}

And returns only the row that contains col2 = 2

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957317#comment-14957317
 ] 

Dominic Ricard commented on SPARK-11103:


Strangely enough, this works perfectly:
{noformat}
select col1 from `table3` where CASE WHEN col2 = 2 THEN true ELSE false END = 
true;
{noformat}

And returns only the row that contains col2 = 2

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Updated] (SPARK-11107) spark.ml should support more input column types: umbrella

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11107:
--
Issue Type: Umbrella  (was: Improvement)

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957360#comment-14957360
 ] 

Thomas Graves commented on SPARK-10873:
---

[~vanzin]  I assume your comment about the backend need to change is to fix 
sorting and such with the existing implementation and not with data tables?   

The datatables generally send all the data and then on the client side does the 
sorting, pagination, etc and from my experience on Hadoop has been very 
performant.  The biggest issue is transferring the data if its a lot but unless 
you go to server side that is going to be an issue with anything.

I agree with you that the sorting currently that doesn't span pages is 
confusing which is why I was thinking something like the datatables that does 
it for us already would be easier.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Dominic Ricard (JIRA)
Dominic Ricard created SPARK-11103:
--

 Summary: Filter applied on Merged Parquet shema with new column 
fail with (java.lang.IllegalArgumentException: Column [column_name] was not 
found in schema!)
 Key: SPARK-11103
 URL: https://issues.apache.org/jira/browse/SPARK-11103
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Dominic Ricard


When evolving a schema in parquet files, spark properly expose all columns 
found in the different parquet files but when trying to query the data, it is 
not possible to apply a filter on a column that is not present in all files.

To reproduce:
*SQL:*
{noformat}
create table `table1` STORED AS PARQUET LOCATION 
'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
create table `table2` STORED AS PARQUET LOCATION 
'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as `col2`;
create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
"hdfs://:/path/to/table");
select col1 from `table3` where col2 = 2;
{noformat}

The last select will output the following Stack Trace:
{noformat}
An error occurred when executing the SQL command:
select col1 from `table3` where col2 = 2

[Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
0, SQL state: TStatus(statusCode:ERROR_STATUS, 
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
 Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 208.92.52.88): 
java.lang.IllegalArgumentException: Column [col2] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
at 
org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace::26:25, 

[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958161#comment-14958161
 ] 

Marcelo Vanzin commented on SPARK-10873:


BTW this could be a good opportunity to rewrite the history server's index page 
to take advantage of the JSON api that was added in 1.4, and potentially drive 
enhancements to it.

That would also allows us to remove the hardcoded HTML from Scala classes.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-11104:


 Summary: A potential deadlock in StreamingContext.stop and 
stopOnShutdown
 Key: SPARK-11104
 URL: https://issues.apache.org/jira/browse/SPARK-11104
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


When the shutdown hook of StreamingContext and StreamingContext.stop are 
running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
running), the following deadlock may happen:

{code}
Java stack information for the threads listed above:
===
"Thread-2":
at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
- waiting to lock <0x0005405a1680> (a 
org.apache.spark.streaming.StreamingContext)
at 
org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
- locked <0x0005405b6a00> (a 
org.apache.spark.util.SparkShutdownHookManager)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
"main":
at 
org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
- waiting to lock <0x0005405b6a00> (a 
org.apache.spark.util.SparkShutdownHookManager)
at 
org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
- locked <0x0005405a1680> (a 
org.apache.spark.streaming.StreamingContext)
at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
- locked <0x0005405a1680> (a 
org.apache.spark.streaming.StreamingContext)
at 
org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
at 
org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956974#comment-14956974
 ] 

Apache Spark commented on SPARK-11104:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9116

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11104:


Assignee: (was: Apache Spark)

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11104:


Assignee: Apache Spark

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956988#comment-14956988
 ] 

Hyukjin Kwon commented on SPARK-11103:
--

I tested this case. The problem was, Parquet filters are pushed down regardless 
of each schema of the splits (or rather files).

Would the predicate pushdown be prevented when using mergeSchema option?

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Resolved] (SPARK-10104) Consolidate different forms of table identifiers

2015-10-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10104.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8453
[https://github.com/apache/spark/pull/8453]

> Consolidate different forms of table identifiers
> 
>
> Key: SPARK-10104
> URL: https://issues.apache.org/jira/browse/SPARK-10104
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to 
> represent table identifiers. We should only have one form and looks 
> TableIdentifier is the best one because it provides methods to get table 
> name, database name, return unquoted string, and return quoted string. 
> There will be TODOs having "SPARK-10104" in it. Those places need to be 
> updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11113) Remove DeveloperApi annotation from private classes

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Remove DeveloperApi annotation from private classes
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> For a variety of reasons, we tagged a bunch of internal classes in the 
> execution package in SQL as DeveloperApi.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11051) NullPointerException when action called on localCheckpointed RDD (that was checkpointed before)

2015-10-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11051:
--
Assignee: Liang-Chi Hsieh  (was: Andrew Or)

> NullPointerException when action called on localCheckpointed RDD (that was 
> checkpointed before)
> ---
>
> Key: SPARK-11051
> URL: https://issues.apache.org/jira/browse/SPARK-11051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Spark version 1.6.0-SNAPSHOT built from the sources as 
> of today - Oct, 10th
>Reporter: Jacek Laskowski
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> While toying with {{RDD.checkpoint}} and {{RDD.localCheckpoint}} methods, the 
> following NullPointerException was thrown:
> {code}
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}
> To reproduce the issue do the following:
> {code}
> $ ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val lines = sc.textFile("README.md")
> lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
> :24
> scala> sc.setCheckpointDir("checkpoints")
> scala> lines.checkpoint
> scala> lines.count
> res2: Long = 98
> scala> lines.localCheckpoint
> 15/10/10 22:59:20 WARN MapPartitionsRDD: RDD was already marked for reliable 
> checkpointing: overriding with local checkpoint.
> res4: lines.type = MapPartitionsRDD[1] at textFile at :24
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11051) NullPointerException when action called on localCheckpointed RDD (that was checkpointed before)

2015-10-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-11051:
-

Assignee: Andrew Or

> NullPointerException when action called on localCheckpointed RDD (that was 
> checkpointed before)
> ---
>
> Key: SPARK-11051
> URL: https://issues.apache.org/jira/browse/SPARK-11051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Spark version 1.6.0-SNAPSHOT built from the sources as 
> of today - Oct, 10th
>Reporter: Jacek Laskowski
>Assignee: Andrew Or
>Priority: Critical
>
> While toying with {{RDD.checkpoint}} and {{RDD.localCheckpoint}} methods, the 
> following NullPointerException was thrown:
> {code}
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}
> To reproduce the issue do the following:
> {code}
> $ ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val lines = sc.textFile("README.md")
> lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
> :24
> scala> sc.setCheckpointDir("checkpoints")
> scala> lines.checkpoint
> scala> lines.count
> res2: Long = 98
> scala> lines.localCheckpoint
> 15/10/10 22:59:20 WARN MapPartitionsRDD: RDD was already marked for reliable 
> checkpointing: overriding with local checkpoint.
> res4: lines.type = MapPartitionsRDD[1] at textFile at :24
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11119) cleanup unsafe array and map

2015-10-14 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9:
---

 Summary: cleanup unsafe array and map
 Key: SPARK-9
 URL: https://issues.apache.org/jira/browse/SPARK-9
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation

2015-10-14 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-11120:
-

 Summary: maxNumExecutorFailures defaults to 3 under dynamic 
allocation
 Key: SPARK-11120
 URL: https://issues.apache.org/jira/browse/SPARK-11120
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Ryan Williams
Priority: Minor


With dynamic allocation, the {{spark.executor.instances}} config is 0, meaning 
[this 
line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68]
 ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has 
resulted in large dynamicAllocation jobs with hundreds of executors dying due 
to one bad node serially failing executors that are allocated on it.

I think that using {{spark.dynamicAllocation.maxExecutors}} would make most 
sense in this case; I frequently run shells that vary between 1 and 1000 
executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would 
still leave me with a value that is lower than makes sense.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958309#comment-14958309
 ] 

Marcelo Vanzin commented on SPARK-11098:


Nevermind (too much). After thinking a little more, while the above is a 
potential problem, it's unrelated to this bug (and might not be fixed by fixing 
this bug). So I'll file a separate one for that fix.

> RPC message ordering is not guaranteed
> --
>
> Key: SPARK-11098
> URL: https://issues.apache.org/jira/browse/SPARK-11098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2