[jira] [Assigned] (SPARK-13529) Move network/* modules into common/network-*

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13529:


Assignee: Reynold Xin  (was: Apache Spark)

> Move network/* modules into common/network-*
> 
>
> Key: SPARK-13529
> URL: https://issues.apache.org/jira/browse/SPARK-13529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This removes one top level folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13529) Move network/* modules into common/network-*

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13529:


Assignee: Apache Spark  (was: Reynold Xin)

> Move network/* modules into common/network-*
> 
>
> Key: SPARK-13529
> URL: https://issues.apache.org/jira/browse/SPARK-13529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This removes one top level folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13529) Move network/* modules into common/network-*

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170401#comment-15170401
 ] 

Apache Spark commented on SPARK-13529:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11409

> Move network/* modules into common/network-*
> 
>
> Key: SPARK-13529
> URL: https://issues.apache.org/jira/browse/SPARK-13529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This removes one top level folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13529) Move network/* modules into common/network-*

2016-02-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13529:
---

 Summary: Move network/* modules into common/network-*
 Key: SPARK-13529
 URL: https://issues.apache.org/jira/browse/SPARK-13529
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin


This removes one top level folder.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13518) Enable vectorized parquet reader by default

2016-02-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13518.
-
   Resolution: Fixed
 Assignee: Nong Li
Fix Version/s: 2.0.0

> Enable vectorized parquet reader by default
> ---
>
> Key: SPARK-13518
> URL: https://issues.apache.org/jira/browse/SPARK-13518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> This feature was disabled by default but implementation now should be 
> complete and this can be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13521) Remove reference to Tachyon in cluster & release script

2016-02-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13521.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove reference to Tachyon in cluster & release script
> ---
>
> Key: SPARK-13521
> URL: https://issues.apache.org/jira/browse/SPARK-13521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> We provide a very limited set of cluster management script in Spark for 
> Tachyon, although Tachyon itself provides a much better version of it. Given 
> now Spark users can simply use Tachyon as a normal file system and does not 
> require extensive configurations, we can remove this management capabilities 
> to simplify Spark bash scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13528) Make the short names of compression codecs consistent in spark

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170364#comment-15170364
 ] 

Apache Spark commented on SPARK-13528:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/11408

> Make the short names of compression codecs consistent in spark
> --
>
> Key: SPARK-13528
> URL: https://issues.apache.org/jira/browse/SPARK-13528
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> Add a common utility code to map short names to fully-qualified codec names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13528) Make the short names of compression codecs consistent in spark

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13528:


Assignee: (was: Apache Spark)

> Make the short names of compression codecs consistent in spark
> --
>
> Key: SPARK-13528
> URL: https://issues.apache.org/jira/browse/SPARK-13528
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> Add a common utility code to map short names to fully-qualified codec names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13528) Make the short names of compression codecs consistent in spark

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13528:


Assignee: Apache Spark

> Make the short names of compression codecs consistent in spark
> --
>
> Key: SPARK-13528
> URL: https://issues.apache.org/jira/browse/SPARK-13528
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> Add a common utility code to map short names to fully-qualified codec names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13528) Make the short names of compression codecs consistent in spark

2016-02-26 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13528:


 Summary: Make the short names of compression codecs consistent in 
spark
 Key: SPARK-13528
 URL: https://issues.apache.org/jira/browse/SPARK-13528
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


Add a common utility code to map short names to fully-qualified codec names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13441) NullPointerException when either HADOOP_CONF_DIR or YARN_CONF_DIR is not readable

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13441:
---
Fix Version/s: (was: 1.6.2)
   1.6.1

> NullPointerException when either HADOOP_CONF_DIR or YARN_CONF_DIR is not 
> readable
> -
>
> Key: SPARK-13441
> URL: https://issues.apache.org/jira/browse/SPARK-13441
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.1, 1.5.1, 1.6.0
>Reporter: Terence Yim
>Assignee: Terence Yim
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> NPE is throw from the yarn Client.scala because {{File.listFiles()}} can 
> return null on directory that it doesn't have permission to list. This is the 
> code fragment in question:
> {noformat}
> // In org/apache/spark/deploy/yarn/Client.scala
> Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR").foreach { envKey =>
>   sys.env.get(envKey).foreach { path =>
> val dir = new File(path)
> if (dir.isDirectory()) {
>   // dir.listFiles() can return null
>   dir.listFiles().foreach { file =>
> if (file.isFile && !hadoopConfFiles.contains(file.getName())) {
>   hadoopConfFiles(file.getName()) = file
> }
>   }
> }
>   }
> }
> {noformat}
> To reproduce, simply do:
> {noformat}
> sudo mkdir /tmp/conf
> sudo chmod 700 /tmp/conf
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> export YARN_CONF_DIR=/tmp/conf
> spark-submit --master yarn-client SimpleApp.py
> {noformat}
> It fails on any Spark app. Though not important, the SimpleApp.py I used 
> looks like this:
> {noformat}
> from pyspark import SparkContext
> sc = SparkContext(None, "Simple App")
> data = [1, 2, 3, 4, 5]
> distData = sc.parallelize(data)
> total = distData.reduce(lambda a, b: a + b)
> print("Total: %i" % total)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12874:
---
Fix Version/s: (was: 1.6.2)
   1.6.1

> ML StringIndexer does not protect itself from column name duplication
> -
>
> Key: SPARK-12874
> URL: https://issues.apache.org/jira/browse/SPARK-12874
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Wojciech Jurczyk
>Assignee: Yu Ishikawa
> Fix For: 1.6.1, 2.0.0
>
>
> StringIndexerModel, when performing transform() does not check the schema of 
> the input DataFrame. Because of that, it is possible to create a DataFrame 
> containing columns with duplicated names.
> This issue is similar to SPARK-12711. StringIndexer could make use of 
> transformSchema to assure that the input DataFrame schema is correct in sense 
> of the parameters' values.
> Please confirm. Then, I'll prepare a PR to resolve the bug.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13355:
---
Fix Version/s: (was: 1.6.2)
   1.6.1

> Replace GraphImpl.fromExistingRDDs by Graph
> ---
>
> Key: SPARK-13355
> URL: https://issues.apache.org/jira/browse/SPARK-13355
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.2, 1.5.3, 1.6.1, 2.0.0
>
>
> `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We 
> call it in LDA without validating this requirement. So it might introduce 
> errors. Replacing it by `Gpaph.apply` would be safer and more proper because 
> it is a public API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12746:
---
Fix Version/s: (was: 1.6.2)
   1.6.1

> ArrayType(_, true) should also accept ArrayType(_, false)
> -
>
> Key: SPARK-12746
> URL: https://issues.apache.org/jira/browse/SPARK-12746
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Earthson Lu
>Assignee: Earthson Lu
> Fix For: 1.6.1, 2.0.0
>
>
> I see CountVectorizer has schema check for ArrayType which has 
> ArrayType(StringType, true). 
> ArrayType(String, false) is just a special case of ArrayType(String, true), 
> but it will not pass this type check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13439) Document that spark.mesos.uris is comma-separated

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13439:
---
Fix Version/s: (was: 1.6.2)
   1.6.1

> Document that spark.mesos.uris is comma-separated
> -
>
> Key: SPARK-13439
> URL: https://issues.apache.org/jira/browse/SPARK-13439
> Project: Spark
>  Issue Type: Documentation
>  Components: Mesos
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
>Priority: Trivial
> Fix For: 1.6.1, 2.0.0
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L346



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13390) Java Spark createDataFrame with List parameter bug

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13390:
---
Fix Version/s: (was: 1.6.2)
   1.6.1

> Java Spark createDataFrame with List parameter bug
> --
>
> Key: SPARK-13390
> URL: https://issues.apache.org/jira/browse/SPARK-13390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Java spark, Linux
>Reporter: mike niemaz
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.6.1
>
>
> I noticed the following bug while testing the dataframe SQL join capabilities.
> Instructions to reproduce it:
> - Read a text file from local file system using JavaSparkContext#texFile 
> method
> - Create a list of related custom objects based on the previously created 
> JavaRDD, using the map function
> -  Create a dataframe using SQLContext createDataFrame(java.util.List, Class) 
> method
>  - Count the dataframe elements using dataframe#count method
> It crashes with the following stacktrace error:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#7L])
> +- TungstenExchange SinglePartition, None
>+- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#10L])
>   +- LocalTableScan [[empty row],[empty row],[empty row],[empty 
> row],[empty row],[empty row],[empty row],[empty row],[empty row],[empty row]]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1537)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1544)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1554)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1553)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2138)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1553)
>   at injection.EMATests.joinTest1(EMATests.java:259)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBe

[jira] [Updated] (SPARK-13253) Error aliasing array columns.

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13253:
---
Target Version/s: 2.0.0, 1.6.2  (was: 1.6.1, 2.0.0)

> Error aliasing array columns.
> -
>
> Key: SPARK-13253
> URL: https://issues.apache.org/jira/browse/SPARK-13253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Rakesh Chalasani
>
> Getting an "UnsupportedOperationException" when trying to alias an
> array column. 
> The issue seems over "toString" on Column. "CreateArray" expression -> 
> dataType, which checks for nullability of its children, while aliasing is 
> creating a PrettyAttribute that does not implement nullability.
> Code to reproduce the error:
> {code}
> import org.apache.spark.sql.SQLContext 
> val sqlContext = new SQLContext(sparkContext) 
> import sqlContext.implicits._ 
> import org.apache.spark.sql.functions 
> case class Test(a:Int, b:Int) 
> val data = sparkContext.parallelize(Array.range(0, 10).map(x => Test(x, 
> x+1))) 
> val df = data.toDF() 
> val arrayCol = functions.array(df("a"), df("b")).as("arrayCol")
> arrayCol.toString()
> {code}
> Error message:
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.PrettyAttribute.nullable(namedExpressions.scala:289)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$dataType$3.apply(complexTypeCreator.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$dataType$3.apply(complexTypeCreator.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
>   at 
> scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:47)
>   at scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
>   at scala.collection.AbstractSeq.prefixLength(Seq.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
>   at scala.collection.mutable.ArrayBuffer.exists(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray.dataType(complexTypeCreator.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.NamedExpression$class.typeSuffix(namedExpressions.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.typeSuffix(namedExpressions.scala:120)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:155)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:207)
>   at org.apache.spark.sql.Column.toString(Column.scala:138)
>   at java.lang.String.valueOf(String.java:2994)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:331)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:20)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12988) Can't drop columns that contain dots

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12988:
---
Target Version/s: 2.0.0, 1.6.2  (was: 1.6.1, 2.0.0)

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13207) _SUCCESS should not break partition discovery

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13207:
---
Target Version/s: 2.0.0, 1.6.2  (was: 1.6.1, 2.0.0)

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11266) Peak memory tests swallow failures

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11266:
---
Target Version/s:   (was: 1.6.1)

> Peak memory tests swallow failures
> --
>
> Key: SPARK-11266
> URL: https://issues.apache.org/jira/browse/SPARK-11266
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Priority: Critical
>
> You have something like the following without the tests failing:
> {code}
> 22:29:03.493 ERROR org.apache.spark.scheduler.LiveListenerBus: Listener 
> SaveInfoListener threw an exception
> org.scalatest.exceptions.TestFailedException: peak execution memory 
> accumulator not set in 'aggregation with codegen'
>   at 
> org.apache.spark.AccumulatorSuite$$anonfun$verifyPeakExecutionMemorySet$1$$anonfun$27.apply(AccumulatorSuite.scala:340)
>   at 
> org.apache.spark.AccumulatorSuite$$anonfun$verifyPeakExecutionMemorySet$1$$anonfun$27.apply(AccumulatorSuite.scala:340)
>   at scala.Option.getOrElse(Option.scala:120)
> {code}
> E.g. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1936/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13029:
---
Target Version/s: 1.5.3, 2.0.0, 1.6.2  (was: 1.5.3, 1.6.1, 2.0.0)

> Logistic regression returns inaccurate results when there is a column with 
> identical value, and fit_intercept=false
> ---
>
> Key: SPARK-13029
> URL: https://issues.apache.org/jira/browse/SPARK-13029
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
>
> This is a bug that appears while fitting a Logistic Regression model with 
> `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix 
> has one column with identical value, the resulting model is not correct. 
> Specifically, the special column will always get a weight of 0, due to the 
> special check inside the code. However, the correct solution, which is unique 
> for L2 logistic regression, usually has non-zero weight.
> I use the heart_scale data 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and 
> manually augmented the data matrix with a column of one (available in the 
> PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the 
> following tools:
>  - libsvm
>  - scikit-learn
>  - sparkml
> (Notice libsvm and scikit-learn use a slightly different formulation, so 
> their regularizer is equivalently set to 1/270).
> The first two will have an objective value 0.7275 and give a solution vector:
> [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 
> 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454
> 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 
> 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 
> 0.1801661775839843, -0.01248615347419409].
> Spark will produce an objective value 0.7278 and give a solution vector:
> [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0]
> Notice the last element of the weight vector is 0.
> A even simpler example is:
> {code:title=benchmark.py|borderStyle=solid}
> import numpy as np
> from sklearn.datasets import load_svmlight_file
> from sklearn.linear_model import LogisticRegression
> x_train = np.array([[1, 1], [0, 1]])
> y_train = np.array([1, 0])
> model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, 
> fit_intercept=False).fit(x_train, y_train)
> print model.coef_
> [[ 0.22478867 -0.02241016]]
> {code}
> The same data trained by the current solver also gives a different result, 
> see the unit test in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10680) Flaky test: network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10680:
---
Target Version/s:   (was: 1.6.1)

> Flaky test: network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests
> --
>
> Key: SPARK-10680
> URL: https://issues.apache.org/jira/browse/SPARK-10680
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: flaky-test
>
> Saw several failures recently.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=spark-test/3560/testReport/junit/org.apache.spark.network/RequestTimeoutIntegrationSuite/timeoutInactiveRequests/
> {code}
> org.apache.spark.network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests
> Failing for the past 1 build (Since Failed#3560 )
> Took 6 sec.
> Stacktrace
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests(RequestTimeoutIntegrationSuite.java:115)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11691) Allow to specify compression codec in HadoopFsRelation when saving

2016-02-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170333#comment-15170333
 ] 

Takeshi Yamamuro edited comment on SPARK-11691 at 2/27/16 3:38 AM:
---

I think it's okay to close this ticket. As [~hyukjin.kwon] said, this issue is 
almost resolved by his pr.
Also, making this ticket a umbrella one for compression stuffs is kind of 
confusing to other developers, I think.


was (Author: maropu):
I think it's okay to close this ticket. As [~hyukjin.kwon] said, this issue is 
totally almost resolved by his pr.
Also, making this ticket a umbrella one for compression stuffs is kind of 
confusing to other developers, I think.

> Allow to specify compression codec in HadoopFsRelation when saving 
> ---
>
> Key: SPARK-11691
> URL: https://issues.apache.org/jira/browse/SPARK-11691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> Currently, there's no way to specify compression codec when saving data frame 
> to hdfs. It would nice to allow specify compression codec in DataFrameWriter 
> just as we did in RDD api
> {code}
> def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = 
> withScope {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11691) Allow to specify compression codec in HadoopFsRelation when saving

2016-02-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170333#comment-15170333
 ] 

Takeshi Yamamuro commented on SPARK-11691:
--

I think it's okay to close this ticket. As [~hyukjin.kwon] said, this issue is 
totally almost resolved by his pr.
Also, making this ticket a umbrella one for compression stuffs is kind of 
confusing to other developers, I think.

> Allow to specify compression codec in HadoopFsRelation when saving 
> ---
>
> Key: SPARK-11691
> URL: https://issues.apache.org/jira/browse/SPARK-11691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> Currently, there's no way to specify compression codec when saving data frame 
> to hdfs. It would nice to allow specify compression codec in DataFrameWriter 
> just as we did in RDD api
> {code}
> def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = 
> withScope {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13527) Prune Filters based on Constraints

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13527:


Assignee: Apache Spark

> Prune Filters based on Constraints
> --
>
> Key: SPARK-13527
> URL: https://issues.apache.org/jira/browse/SPARK-13527
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Remove all the deterministic conditions in a [[Filter]] that are contained in 
> the Child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13527) Prune Filters based on Constraints

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170320#comment-15170320
 ] 

Apache Spark commented on SPARK-13527:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11406

> Prune Filters based on Constraints
> --
>
> Key: SPARK-13527
> URL: https://issues.apache.org/jira/browse/SPARK-13527
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Remove all the deterministic conditions in a [[Filter]] that are contained in 
> the Child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13527) Prune Filters based on Constraints

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13527:


Assignee: (was: Apache Spark)

> Prune Filters based on Constraints
> --
>
> Key: SPARK-13527
> URL: https://issues.apache.org/jira/browse/SPARK-13527
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Remove all the deterministic conditions in a [[Filter]] that are contained in 
> the Child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13527) Prune Filters based on Constraints

2016-02-26 Thread Xiao Li (JIRA)
Xiao Li created SPARK-13527:
---

 Summary: Prune Filters based on Constraints
 Key: SPARK-13527
 URL: https://issues.apache.org/jira/browse/SPARK-13527
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Remove all the deterministic conditions in a [[Filter]] that are contained in 
the Child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13474) Update packaging scripts to stage artifacts to home.apache.org

2016-02-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13474.

   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 11350
[https://github.com/apache/spark/pull/11350]

> Update packaging scripts to stage artifacts to home.apache.org
> --
>
> Key: SPARK-13474
> URL: https://issues.apache.org/jira/browse/SPARK-13474
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0, 1.6.1
>
>
> Due to the people.apache.org -> home.apache.org migration, we need to update 
> our packaging scripts to publish artifacts to the new server. Because the new 
> server only supports sftp instead of ssh, we need to update the scripts to 
> use lftp instead of ssh + rsync.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13526) Refactor: Move SQLContext/HiveContext per-session state to separate class

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13526:


Assignee: Andrew Or  (was: Apache Spark)

> Refactor: Move SQLContext/HiveContext per-session state to separate class
> -
>
> Key: SPARK-13526
> URL: https://issues.apache.org/jira/browse/SPARK-13526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is just a clean up task. Today there are all these fields in SQLContext 
> that are not organized in any particular way. However, since each SQLContext 
> is a session, many of these fields are actually isolated per-session. To 
> minimize the size of these context files and provide a logical grouping that 
> makes more sense, I propose that we move these fields into its own class, 
> called SessionState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13526) Refactor: Move SQLContext/HiveContext per-session state to separate class

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170269#comment-15170269
 ] 

Apache Spark commented on SPARK-13526:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11405

> Refactor: Move SQLContext/HiveContext per-session state to separate class
> -
>
> Key: SPARK-13526
> URL: https://issues.apache.org/jira/browse/SPARK-13526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is just a clean up task. Today there are all these fields in SQLContext 
> that are not organized in any particular way. However, since each SQLContext 
> is a session, many of these fields are actually isolated per-session. To 
> minimize the size of these context files and provide a logical grouping that 
> makes more sense, I propose that we move these fields into its own class, 
> called SessionState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13526) Refactor: Move SQLContext/HiveContext per-session state to separate class

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13526:


Assignee: Apache Spark  (was: Andrew Or)

> Refactor: Move SQLContext/HiveContext per-session state to separate class
> -
>
> Key: SPARK-13526
> URL: https://issues.apache.org/jira/browse/SPARK-13526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> This is just a clean up task. Today there are all these fields in SQLContext 
> that are not organized in any particular way. However, since each SQLContext 
> is a session, many of these fields are actually isolated per-session. To 
> minimize the size of these context files and provide a logical grouping that 
> makes more sense, I propose that we move these fields into its own class, 
> called SessionState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13526) Refactor: Move SQLContext/HiveContext per-session state to separate class

2016-02-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13526:
-

 Summary: Refactor: Move SQLContext/HiveContext per-session state 
to separate class
 Key: SPARK-13526
 URL: https://issues.apache.org/jira/browse/SPARK-13526
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or


This is just a clean up task. Today there are all these fields in SQLContext 
that are not organized in any particular way. However, since each SQLContext is 
a session, many of these fields are actually isolated per-session. To minimize 
the size of these context files and provide a logical grouping that makes more 
sense, I propose that we move these fields into its own class, called 
SessionState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170231#comment-15170231
 ] 

Xiao Li commented on SPARK-1:
-

[~josephkb] The result is right. unionall does not consider the column names.  

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2016-02-26 Thread Paul Greyson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170199#comment-15170199
 ] 

Paul Greyson commented on SPARK-10659:
--

I believe this makes predicate pushdown in parquet useless due to 
https://issues.apache.org/jira/browse/SPARK-1847 if the parquet file was 
produced with Spark. Is that right?

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>   // This is a hack. We always set 
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2016-02-26 Thread Paul Greyson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Greyson updated SPARK-10659:
-
Comment: was deleted

(was: Seems like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-11360 which is fixed in 1.6?)

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>   // This is a hack. We always set 
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170194#comment-15170194
 ] 

Apache Spark commented on SPARK-12633:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11404

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2016-02-26 Thread Paul Greyson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170188#comment-15170188
 ] 

Paul Greyson commented on SPARK-10659:
--

Seems like a duplicate of https://issues.apache.org/jira/browse/SPARK-11360 
which is fixed in 1.6?

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> {code}
> val relation = if (doInsertion) {
>   // This is a hack. We always set 
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters, 
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> {code}
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13247) REST API ignores spark.jars property

2016-02-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170166#comment-15170166
 ] 

Łukasz Gieroń commented on SPARK-13247:
---

Hi [~AdamWesterman]

In the spec[1], attached to the ticket[2] for redesigning this API, it says:
"It is also not a goal to expose the new gateway as a general mechanism for 
users of Spark to
submit their applications. The new gateway will be used strictly internally 
between Spark
submit and the standalone Master."
("new gateway" being the REST API)

My understanding is that, in the light of this spec, what you're trying to do 
here is not supported by the API - so it's not a bug.

[1] 
https://issues.apache.org/jira/secure/attachment/12696651/stable-spark-submit-in-standalone-mode-2-4-15.pdf
[2] https://issues.apache.org/jira/browse/SPARK-5388

> REST API ignores spark.jars property
> 
>
> Key: SPARK-13247
> URL: https://issues.apache.org/jira/browse/SPARK-13247
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Adam Westerman
>Priority: Minor
>
> When submitting a job via the REST API, if you attempt to include any extra 
> jars along with the job submission via the "spark.jars" property in the JSON 
> request body, the extra jars do not end up getting picked up by the driver 
> when the job is submitted.  
> Note: It's entirely possible that any of the configurations outside of the 
> required properties are also ignored, but spark.jars is one I'm certain isn't 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170146#comment-15170146
 ] 

Joseph K. Bradley commented on SPARK-1:
---

[~rxin] Is this the same issue as the following?  This is in the current master:
{code}
val e1 = sqlContext.createDataFrame(List(
  ("a", "b"),
  ("b", "c"),
  ("c", "d")
)).toDF("src", "dst")
val e2 = e1.select(e1("src").as("dst"), e1("dst").as("src"))
val e3 = e1.unionAll(e2)
e3.show()

+---+---+
|src|dst|
+---+---+
|  a|  b|
|  b|  c|
|  c|  d|
|  a|  b|
|  b|  c|
|  c|  d|
+---+---+
{code}


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13519) Driver should tell Executor to stop itself when cleaning executor's state

2016-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13519.
---
  Resolution: Fixed
Assignee: Shixiong Zhu
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Driver should tell Executor to stop itself when cleaning executor's state
> -
>
> Key: SPARK-13519
> URL: https://issues.apache.org/jira/browse/SPARK-13519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> When the driver removes an executor's state, the connection between the 
> driver and the executor may be still alive so that the executor cannot exit 
> automatically (E.g., Master will send RemoveExecutor when a work is lost but 
> the executor is still alive), so the driver try to tell the executor to stop 
> itself. Otherwise, we will leak an executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-02-26 Thread Shubhanshu Mishra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubhanshu Mishra updated SPARK-13525:
--
Description: 
I am following the code steps from this example:
https://spark.apache.org/docs/1.6.0/sparkr.html

There are multiple issues: 
1. The head and summary and filter methods are not overridden by spark. Hence I 
need to call them using `SparkR::` namespace.
2. When I try to execute the following, I get errors:

{code}
$> $R_HOME/bin/R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


Welcome at Fri Feb 26 16:19:35 2016 

Attaching package: ‘SparkR’

The following objects are masked from ‘package:base’:

colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
summary, transform

Launching java with spark-submit command 
/content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> df <- createDataFrame(sqlContext, iris)
Warning messages:
1: In FUN(X[[i]], ...) :
  Use Sepal_Length instead of Sepal.Length  as column name
2: In FUN(X[[i]], ...) :
  Use Sepal_Width instead of Sepal.Width  as column name
3: In FUN(X[[i]], ...) :
  Use Petal_Length instead of Petal.Length  as column name
4: In FUN(X[[i]], ...) :
  Use Petal_Width instead of Petal.Width  as column name
> training <- filter(df, df$Species != "setosa")
Error in filter(df, df$Species != "setosa") : 
  no method for coercing this S4 class to a vector
> training <- SparkR::filter(df, df$Species != "setosa")
> model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> family = "binomial")
16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util

[jira] [Updated] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-02-26 Thread Shubhanshu Mishra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubhanshu Mishra updated SPARK-13525:
--
Description: 
I am following the code steps from this example:
https://spark.apache.org/docs/1.6.0/sparkr.html

There are multiple issues: 
1. The head and summary and filter methods are not overridden by spark. Hence I 
need to call them using `SparkR::` namespace.
2. When I try to execute the following, I get errors:

{code}
$> $R_HOME/bin/R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


Welcome at Fri Feb 26 16:19:35 2016 

Attaching package: ‘SparkR’

The following objects are masked from ‘package:base’:

colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
summary, transform

Launching java with spark-submit command 
/content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> df <- createDataFrame(sqlContext, iris)
Warning messages:
1: In FUN(X[[i]], ...) :
  Use Sepal_Length instead of Sepal.Length  as column name
2: In FUN(X[[i]], ...) :
  Use Sepal_Width instead of Sepal.Width  as column name
3: In FUN(X[[i]], ...) :
  Use Petal_Length instead of Petal.Length  as column name
4: In FUN(X[[i]], ...) :
  Use Petal_Width instead of Petal.Width  as column name
> training <- filter(df, df$Species != "setosa")
Error in filter(df, df$Species != "setosa") : 
  no method for coercing this S4 class to a vector
> training <- SparkR::filter(df, df$Species != "setosa")
> model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> family = "binomial")
16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util

[jira] [Created] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-02-26 Thread Shubhanshu Mishra (JIRA)
Shubhanshu Mishra created SPARK-13525:
-

 Summary: SparkR: java.net.SocketTimeoutException: Accept timed out 
when running any dataframe function
 Key: SPARK-13525
 URL: https://issues.apache.org/jira/browse/SPARK-13525
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Shubhanshu Mishra


I am following the code steps from this example:
https://spark.apache.org/docs/1.6.0/sparkr.html

There are multiple issues: 
1. The head and summary and filter methods are not overridden by spark. Hence I 
need to call them using `SparkR::` namespace.
2. When I try to execute the following, I get errors:

{code:R}
$R_HOME/bin/R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


Welcome at Fri Feb 26 16:19:35 2016 

Attaching package: ‘SparkR’

The following objects are masked from ‘package:base’:

colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
summary, transform

Launching java with spark-submit command 
/content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> df <- createDataFrame(sqlContext, iris)
Warning messages:
1: In FUN(X[[i]], ...) :
  Use Sepal_Length instead of Sepal.Length  as column name
2: In FUN(X[[i]], ...) :
  Use Sepal_Width instead of Sepal.Width  as column name
3: In FUN(X[[i]], ...) :
  Use Petal_Length instead of Petal.Length  as column name
4: In FUN(X[[i]], ...) :
  Use Petal_Width instead of Petal.Width  as column name
> training <- filter(df, df$Species != "setosa")
Error in filter(df, df$Species != "setosa") : 
  no method for coercing this S4 class to a vector
> training <- SparkR::filter(df, df$Species != "setosa")
> model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> family = "binomial")
16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
at org.apache.spark.scheduler.Task.run(Task.

[jira] [Updated] (SPARK-13505) Python API for MaxAbsScaler

2016-02-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13505:
--
Assignee: Li Ping Zhang

> Python API for MaxAbsScaler
> ---
>
> Key: SPARK-13505
> URL: https://issues.apache.org/jira/browse/SPARK-13505
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Li Ping Zhang
>  Labels: starter
> Fix For: 2.0.0
>
>
> After SPARK-13028, we should add Python API for MaxAbsScaler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13505) Python API for MaxAbsScaler

2016-02-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13505.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11393
[https://github.com/apache/spark/pull/11393]

> Python API for MaxAbsScaler
> ---
>
> Key: SPARK-13505
> URL: https://issues.apache.org/jira/browse/SPARK-13505
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
> Fix For: 2.0.0
>
>
> After SPARK-13028, we should add Python API for MaxAbsScaler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13454) Cannot drop table whose name starts with underscore

2016-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13454.
--
   Resolution: Fixed
Fix Version/s: 1.6.1

Issue resolved by pull request 11349
[https://github.com/apache/spark/pull/11349]

> Cannot drop table whose name starts with underscore
> ---
>
> Key: SPARK-13454
> URL: https://issues.apache.org/jira/browse/SPARK-13454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Cheng Lian
> Fix For: 1.6.1
>
>
> Spark shell snippet for reproduction:
> {code}
> sqlContext.sql("CREATE TABLE `_a`(i INT)") // This one works.
> sqlContext.sql("DROP TABLE `_a`") // This one failed. Basically, we cannot 
> drop a table starting with _ in Spark 1.6.0. Master is fine.
> {code}
> Exception thrown:
> {noformat}
> NoViableAltException(13@[192:1: tableName : (db= identifier DOT tab= 
> identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME 
> $tab) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:396)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:473)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:473)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:463)
> at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:605)
> at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
> at 
> $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
> at $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $line21.$read$$iwC$$iwC$$iwC$$

[jira] [Assigned] (SPARK-13454) Cannot drop table whose name starts with underscore

2016-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-13454:


Assignee: Yin Huai

> Cannot drop table whose name starts with underscore
> ---
>
> Key: SPARK-13454
> URL: https://issues.apache.org/jira/browse/SPARK-13454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Cheng Lian
>Assignee: Yin Huai
> Fix For: 1.6.1
>
>
> Spark shell snippet for reproduction:
> {code}
> sqlContext.sql("CREATE TABLE `_a`(i INT)") // This one works.
> sqlContext.sql("DROP TABLE `_a`") // This one failed. Basically, we cannot 
> drop a table starting with _ in Spark 1.6.0. Master is fine.
> {code}
> Exception thrown:
> {noformat}
> NoViableAltException(13@[192:1: tableName : (db= identifier DOT tab= 
> identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME 
> $tab) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:396)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:473)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:473)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:463)
> at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:605)
> at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
> at 
> $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
> at $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $line21.$read$$iwC$$iwC$$iwC$$iwC.(:37)
> at $line21.$read$$iwC$$iwC$$iwC.(:39)
> at $l

[jira] [Resolved] (SPARK-13500) Add an example for LDA in PySpark

2016-02-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-13500.
--
Resolution: Duplicate

this example and others are being added as part of this

> Add an example for LDA in PySpark
> -
>
> Key: SPARK-13500
> URL: https://issues.apache.org/jira/browse/SPARK-13500
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> PySpark is missing an example for MLlib LDA usage, it would be nice to have 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13518) Enable vectorized parquet reader by default

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169975#comment-15169975
 ] 

Apache Spark commented on SPARK-13518:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/11397

> Enable vectorized parquet reader by default
> ---
>
> Key: SPARK-13518
> URL: https://issues.apache.org/jira/browse/SPARK-13518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>
> This feature was disabled by default but implementation now should be 
> complete and this can be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13518) Enable vectorized parquet reader by default

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13518:


Assignee: Apache Spark

> Enable vectorized parquet reader by default
> ---
>
> Key: SPARK-13518
> URL: https://issues.apache.org/jira/browse/SPARK-13518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
>
> This feature was disabled by default but implementation now should be 
> complete and this can be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13518) Enable vectorized parquet reader by default

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13518:


Assignee: (was: Apache Spark)

> Enable vectorized parquet reader by default
> ---
>
> Key: SPARK-13518
> URL: https://issues.apache.org/jira/browse/SPARK-13518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>
> This feature was disabled by default but implementation now should be 
> complete and this can be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13524) Remove BroadcastManager

2016-02-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13524:
-

 Summary: Remove BroadcastManager
 Key: SPARK-13524
 URL: https://issues.apache.org/jira/browse/SPARK-13524
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Andrew Or


It has two methods that we actually care about, broadcast and unbroadcast. 
That's it. Also it doesn't make sense to create BroadcastManager on executors. 
There's a flag in the constructor called `isDriver` but if you trace it 
downstream it's not used anywhere!

TL;DR BroadcastManager is not needed and there's a lot of opportunity for clean 
up here. This doesn't have to happen before 2.0.0 since it's all internal API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13523) Reuse the exchanges in a query

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169893#comment-15169893
 ] 

Apache Spark commented on SPARK-13523:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11403

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13523) Reuse the exchanges in a query

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13523:


Assignee: (was: Apache Spark)

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13523) Reuse the exchanges in a query

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13523:


Assignee: Apache Spark

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13523) Reuse the exchanges in a query

2016-02-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13523:
--

 Summary: Reuse the exchanges in a query
 Key: SPARK-13523
 URL: https://issues.apache.org/jira/browse/SPARK-13523
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu


In exchange, the RDD will be materialized (shuffled or collected), it's a good 
point to eliminate common part of a query.

In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange or 
BroadcastExchange) could be used multiple times, we should re-use them to avoid 
the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13444) QuantileDiscretizer chooses bad splits on large DataFrames

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169835#comment-15169835
 ] 

Apache Spark commented on SPARK-13444:
--

User 'oliverpierson' has created a pull request for this issue:
https://github.com/apache/spark/pull/11402

> QuantileDiscretizer chooses bad splits on large DataFrames
> --
>
> Key: SPARK-13444
> URL: https://issues.apache.org/jira/browse/SPARK-13444
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
> Fix For: 2.0.0
>
>
> In certain circumstances, QuantileDiscretizer fails to calculate the correct 
> splits and will instead split data into two bins regardless of the value 
> specified in numBuckets.
> For example, supposed dataset.count is 200 million.  And we do
> val discretizer = new QuantileDiscretizer().setNumBuckets(10)
>   ... set output and input columns ...
> val dataWithBins = discretizer.fit(dataset).transform(dataset)
> In this case, dataWithBins will have only two distinct bins versus the 
> expected 10.
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed 
> by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 1)
> after: val requiredSamples = math.max(numBins * numBins, 1.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13522) Executor should kill itself when it's unable to heartbeat to the driver more than N times

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13522:


Assignee: (was: Apache Spark)

> Executor should kill itself when it's unable to heartbeat to the driver more 
> than N times
> -
>
> Key: SPARK-13522
> URL: https://issues.apache.org/jira/browse/SPARK-13522
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> Sometimes, network disconnection event won't be triggered for other potential 
> race conditions that we may not have thought of, then the executor will keep 
> sending heartbeats to driver and won't exit.
> We should make Executor kill itself when it's unable to heartbeat to the 
> driver more than N times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13522) Executor should kill itself when it's unable to heartbeat to the driver more than N times

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13522:


Assignee: Apache Spark

> Executor should kill itself when it's unable to heartbeat to the driver more 
> than N times
> -
>
> Key: SPARK-13522
> URL: https://issues.apache.org/jira/browse/SPARK-13522
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Sometimes, network disconnection event won't be triggered for other potential 
> race conditions that we may not have thought of, then the executor will keep 
> sending heartbeats to driver and won't exit.
> We should make Executor kill itself when it's unable to heartbeat to the 
> driver more than N times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13522) Executor should kill itself when it's unable to heartbeat to the driver more than N times

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169795#comment-15169795
 ] 

Apache Spark commented on SPARK-13522:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11401

> Executor should kill itself when it's unable to heartbeat to the driver more 
> than N times
> -
>
> Key: SPARK-13522
> URL: https://issues.apache.org/jira/browse/SPARK-13522
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> Sometimes, network disconnection event won't be triggered for other potential 
> race conditions that we may not have thought of, then the executor will keep 
> sending heartbeats to driver and won't exit.
> We should make Executor kill itself when it's unable to heartbeat to the 
> driver more than N times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13522) Executor should kill itself when it's unable to heartbeat to the driver more than N times

2016-02-26 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13522:


 Summary: Executor should kill itself when it's unable to heartbeat 
to the driver more than N times
 Key: SPARK-13522
 URL: https://issues.apache.org/jira/browse/SPARK-13522
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Shixiong Zhu


Sometimes, network disconnection event won't be triggered for other potential 
race conditions that we may not have thought of, then the executor will keep 
sending heartbeats to driver and won't exit.

We should make Executor kill itself when it's unable to heartbeat to the driver 
more than N times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13465) Add a task failure listener to TaskContext

2016-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13465.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11340
[https://github.com/apache/spark/pull/11340]

> Add a task failure listener to TaskContext
> --
>
> Key: SPARK-13465
> URL: https://issues.apache.org/jira/browse/SPARK-13465
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> TaskContext supports task completion callback, which gets called regardless 
> of task failures. However, there is no way for the listener to know if there 
> is an error. This ticket proposes adding a new listener that gets called when 
> a task fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13499) Optimize vectorized parquet reader for dictionary encoded data and RLE decoding

2016-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13499.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11375
[https://github.com/apache/spark/pull/11375]

> Optimize vectorized parquet reader for dictionary encoded data and RLE 
> decoding
> ---
>
> Key: SPARK-13499
> URL: https://issues.apache.org/jira/browse/SPARK-13499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13521) Remove reference to Tachyon in cluster & release script

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13521:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove reference to Tachyon in cluster & release script
> ---
>
> Key: SPARK-13521
> URL: https://issues.apache.org/jira/browse/SPARK-13521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We provide a very limited set of cluster management script in Spark for 
> Tachyon, although Tachyon itself provides a much better version of it. Given 
> now Spark users can simply use Tachyon as a normal file system and does not 
> require extensive configurations, we can remove this management capabilities 
> to simplify Spark bash scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13521) Remove reference to Tachyon in cluster & release script

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169700#comment-15169700
 ] 

Apache Spark commented on SPARK-13521:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11400

> Remove reference to Tachyon in cluster & release script
> ---
>
> Key: SPARK-13521
> URL: https://issues.apache.org/jira/browse/SPARK-13521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We provide a very limited set of cluster management script in Spark for 
> Tachyon, although Tachyon itself provides a much better version of it. Given 
> now Spark users can simply use Tachyon as a normal file system and does not 
> require extensive configurations, we can remove this management capabilities 
> to simplify Spark bash scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13521) Remove reference to Tachyon in cluster & release script

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13521:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove reference to Tachyon in cluster & release script
> ---
>
> Key: SPARK-13521
> URL: https://issues.apache.org/jira/browse/SPARK-13521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We provide a very limited set of cluster management script in Spark for 
> Tachyon, although Tachyon itself provides a much better version of it. Given 
> now Spark users can simply use Tachyon as a normal file system and does not 
> require extensive configurations, we can remove this management capabilities 
> to simplify Spark bash scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13521) Remove reference to Tachyon in cluster & release script

2016-02-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13521:
---

 Summary: Remove reference to Tachyon in cluster & release script
 Key: SPARK-13521
 URL: https://issues.apache.org/jira/browse/SPARK-13521
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We provide a very limited set of cluster management script in Spark for 
Tachyon, although Tachyon itself provides a much better version of it. Given 
now Spark users can simply use Tachyon as a normal file system and does not 
require extensive configurations, we can remove this management capabilities to 
simplify Spark bash scripts.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13517) Expose regression summary classes in Pyspark

2016-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13517:
--
Labels:   (was: classification ml mllib pyspark regression summary)
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)
   Summary: Expose regression summary classes in Pyspark  (was: Summary 
classes of scala not exposed in Pyspark)

[~shubhanshumis...@gmail.com] I fixed the title and other JIRA fields, but 
please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening JIRAs 

> Expose regression summary classes in Pyspark
> 
>
> Key: SPARK-13517
> URL: https://issues.apache.org/jira/browse/SPARK-13517
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Shubhanshu Mishra
>Priority: Minor
>
> Many of the scala classes of MLLIB for extracting summary statistics of 
> models are not available in the pyspark API. 
> - LinearRegressionSummary
> - LinearRegressionTrainingSummary
> - BinaryLogisticRegressionSummary
> - BinaryLogisticRegressionTrainingSummary
> - LogisticRegressionSummary
> - LogisticRegressionTrainingSummary
> E.g. 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.LinearRegressionTrainingSummary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13520) Design doc for configuration in Spark 2.0+

2016-02-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13520:
---

 Summary: Design doc for configuration in Spark 2.0+
 Key: SPARK-13520
 URL: https://issues.apache.org/jira/browse/SPARK-13520
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: User-facingConfigurationinSpark2.0.pdf

This is just a ticket to post the design doc for user-facing configuration 
management in Spark 2.0.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13520) Design doc for configuration in Spark 2.0+

2016-02-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13520:

Attachment: User-facingConfigurationinSpark2.0.pdf

design doc

> Design doc for configuration in Spark 2.0+
> --
>
> Key: SPARK-13520
> URL: https://issues.apache.org/jira/browse/SPARK-13520
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: User-facingConfigurationinSpark2.0.pdf
>
>
> This is just a ticket to post the design doc for user-facing configuration 
> management in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13520) Design doc for configuration in Spark 2.0+

2016-02-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-13520.
---
Resolution: Fixed

Closing the ticket since I posted the design doc.



> Design doc for configuration in Spark 2.0+
> --
>
> Key: SPARK-13520
> URL: https://issues.apache.org/jira/browse/SPARK-13520
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: User-facingConfigurationinSpark2.0.pdf
>
>
> This is just a ticket to post the design doc for user-facing configuration 
> management in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13520) Design doc for configuration in Spark 2.0+

2016-02-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169664#comment-15169664
 ] 

Reynold Xin edited comment on SPARK-13520 at 2/26/16 7:54 PM:
--

Closing the ticket since I posted the design doc. This is a good place to 
discuss this design if there are any comments.



was (Author: rxin):
Closing the ticket since I posted the design doc.



> Design doc for configuration in Spark 2.0+
> --
>
> Key: SPARK-13520
> URL: https://issues.apache.org/jira/browse/SPARK-13520
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: User-facingConfigurationinSpark2.0.pdf
>
>
> This is just a ticket to post the design doc for user-facing configuration 
> management in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13485) Dataset-oriented API foundation in Spark 2.0

2016-02-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13485:

Summary: Dataset-oriented API foundation in Spark 2.0  (was: Dataset API 
foundation in Spark 2.0)

> Dataset-oriented API foundation in Spark 2.0
> 
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext is not ideal 
> because of the name "SQL")
> 3. First class support for sessions
> 4. First class support for some system catalog



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13519) Driver should tell Executor to stop itself when cleaning executor's state

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13519:


Assignee: Apache Spark

> Driver should tell Executor to stop itself when cleaning executor's state
> -
>
> Key: SPARK-13519
> URL: https://issues.apache.org/jira/browse/SPARK-13519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When the driver removes an executor's state, the connection between the 
> driver and the executor may be still alive so that the executor cannot exit 
> automatically (E.g., Master will send RemoveExecutor when a work is lost but 
> the executor is still alive), so the driver try to tell the executor to stop 
> itself. Otherwise, we will leak an executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13519) Driver should tell Executor to stop itself when cleaning executor's state

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169655#comment-15169655
 ] 

Apache Spark commented on SPARK-13519:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11399

> Driver should tell Executor to stop itself when cleaning executor's state
> -
>
> Key: SPARK-13519
> URL: https://issues.apache.org/jira/browse/SPARK-13519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> When the driver removes an executor's state, the connection between the 
> driver and the executor may be still alive so that the executor cannot exit 
> automatically (E.g., Master will send RemoveExecutor when a work is lost but 
> the executor is still alive), so the driver try to tell the executor to stop 
> itself. Otherwise, we will leak an executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13519) Driver should tell Executor to stop itself when cleaning executor's state

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13519:


Assignee: (was: Apache Spark)

> Driver should tell Executor to stop itself when cleaning executor's state
> -
>
> Key: SPARK-13519
> URL: https://issues.apache.org/jira/browse/SPARK-13519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> When the driver removes an executor's state, the connection between the 
> driver and the executor may be still alive so that the executor cannot exit 
> automatically (E.g., Master will send RemoveExecutor when a work is lost but 
> the executor is still alive), so the driver try to tell the executor to stop 
> itself. Otherwise, we will leak an executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13519) Driver should tell Executor to stop itself when cleaning executor's state

2016-02-26 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13519:


 Summary: Driver should tell Executor to stop itself when cleaning 
executor's state
 Key: SPARK-13519
 URL: https://issues.apache.org/jira/browse/SPARK-13519
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Shixiong Zhu


When the driver removes an executor's state, the connection between the driver 
and the executor may be still alive so that the executor cannot exit 
automatically (E.g., Master will send RemoveExecutor when a work is lost but 
the executor is still alive), so the driver try to tell the executor to stop 
itself. Otherwise, we will leak an executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13500) Add an example for LDA in PySpark

2016-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169595#comment-15169595
 ] 

Apache Spark commented on SPARK-13500:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11398

> Add an example for LDA in PySpark
> -
>
> Key: SPARK-13500
> URL: https://issues.apache.org/jira/browse/SPARK-13500
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> PySpark is missing an example for MLlib LDA usage, it would be nice to have 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13500) Add an example for LDA in PySpark

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13500:


Assignee: (was: Apache Spark)

> Add an example for LDA in PySpark
> -
>
> Key: SPARK-13500
> URL: https://issues.apache.org/jira/browse/SPARK-13500
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> PySpark is missing an example for MLlib LDA usage, it would be nice to have 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13500) Add an example for LDA in PySpark

2016-02-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13500:


Assignee: Apache Spark

> Add an example for LDA in PySpark
> -
>
> Key: SPARK-13500
> URL: https://issues.apache.org/jira/browse/SPARK-13500
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark is missing an example for MLlib LDA usage, it would be nice to have 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13492) Configure a custom webui_url for the Spark Mesos Framework

2016-02-26 Thread Sergiusz Urbaniak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169589#comment-15169589
 ] 

Sergiusz Urbaniak commented on SPARK-13492:
---

/cc [~dragos]

> Configure a custom webui_url for the Spark Mesos Framework
> --
>
> Key: SPARK-13492
> URL: https://issues.apache.org/jira/browse/SPARK-13492
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Sergiusz Urbaniak
>Priority: Minor
>
> Previously the Mesos framework webui URL was being derived only from the 
> Spark UI address leaving no possibility to configure it. This issue proposes 
> to make it configurable. If unset it falls back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13492) Configure a custom webui_url for the Spark Mesos Framework

2016-02-26 Thread Sergiusz Urbaniak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169581#comment-15169581
 ] 

Sergiusz Urbaniak commented on SPARK-13492:
---

/cc [~tnachen]

> Configure a custom webui_url for the Spark Mesos Framework
> --
>
> Key: SPARK-13492
> URL: https://issues.apache.org/jira/browse/SPARK-13492
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Sergiusz Urbaniak
>Priority: Minor
>
> Previously the Mesos framework webui URL was being derived only from the 
> Spark UI address leaving no possibility to configure it. This issue proposes 
> to make it configurable. If unset it falls back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13518) Enable vectorized parquet reader by default

2016-02-26 Thread Nong Li (JIRA)
Nong Li created SPARK-13518:
---

 Summary: Enable vectorized parquet reader by default
 Key: SPARK-13518
 URL: https://issues.apache.org/jira/browse/SPARK-13518
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nong Li


This feature was disabled by default but implementation now should be complete 
and this can be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13517) Summary classes of scala not exposed in Pyspark

2016-02-26 Thread Shubhanshu Mishra (JIRA)
Shubhanshu Mishra created SPARK-13517:
-

 Summary: Summary classes of scala not exposed in Pyspark
 Key: SPARK-13517
 URL: https://issues.apache.org/jira/browse/SPARK-13517
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: Shubhanshu Mishra


Many of the scala classes of MLLIB for extracting summary statistics of models 
are not available in the pyspark API. 

- LinearRegressionSummary
- LinearRegressionTrainingSummary
- BinaryLogisticRegressionSummary
- BinaryLogisticRegressionTrainingSummary
- LogisticRegressionSummary
- LogisticRegressionTrainingSummary



E.g. 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.LinearRegressionTrainingSummary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13502) Missing ml.NaiveBayes in MLlib guide

2016-02-26 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169449#comment-15169449
 ] 

Xusen Yin commented on SPARK-13502:
---

Thanks, I close it.

> Missing ml.NaiveBayes in MLlib guide
> 
>
> Key: SPARK-13502
> URL: https://issues.apache.org/jira/browse/SPARK-13502
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Xusen Yin
>Priority: Trivial
>
> There is no ml.NaiveBayes in docs/ml-classification-regression.md. Just like 
> other classification methods, we should write a section for it with a 
> runnable example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13502) Missing ml.NaiveBayes in MLlib guide

2016-02-26 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin closed SPARK-13502.
-
Resolution: Duplicate

> Missing ml.NaiveBayes in MLlib guide
> 
>
> Key: SPARK-13502
> URL: https://issues.apache.org/jira/browse/SPARK-13502
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Xusen Yin
>Priority: Trivial
>
> There is no ml.NaiveBayes in docs/ml-classification-regression.md. Just like 
> other classification methods, we should write a section for it with a 
> runnable example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7768) Make user-defined type (UDT) API public

2016-02-26 Thread Jakob Odersky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky reopened SPARK-7768:
--

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns

2016-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12313.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11328
[https://github.com/apache/spark/pull/11328]

> getPartitionsByFilter doesnt handle predicates on all / multiple Partition 
> Columns
> --
>
> Key: SPARK-12313
> URL: https://issues.apache.org/jira/browse/SPARK-12313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Gobinathan SP
>Priority: Critical
> Fix For: 2.0.0
>
>
> When enabled spark.sql.hive.metastorePartitionPruning, the 
> getPartitionsByFilter is used
> For a table partitioned by p1 and p2, when triggered hc.sql("select col 
> from tabl1 where p1='p1V' and p2= 'p2V' ")
> The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' 
> and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as 
> filter string.
> On these cases the partitions are not returned from Hive's 
> getPartitionsByFilter method. As a result, for the sql, the number of 
> returned rows is always zero. 
> However, filter on a single column always works. Probalbly  it doesn't come 
> through this route
> I'm using Oracle for Metstore V0.13.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2016-02-26 Thread Randall Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169380#comment-15169380
 ] 

Randall Whitman commented on SPARK-7768:


Am I missing something?

As far as I can see, the @Experimental annotation is still present on class 
UserDefinedFunction - 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

Also, I have not seen any mention of having addressed the design issue of using 
@SQLUserDefinedType with third-party libraries, that is discussed in this JIRA, 
2015/05/21 through 2015/06/12.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2016-02-26 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169364#comment-15169364
 ] 

Shixiong Zhu commented on SPARK-8630:
-

[~crakjie] Could you use 1.5.1 or 1.6.0? This was fixed in SPARK-10071

> Prevent from checkpointing QueueInputDStream
> 
>
> Key: SPARK-8630
> URL: https://issues.apache.org/jira/browse/SPARK-8630
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.4.1, 1.5.0
>
>
> It's better to prevent from checkpointing QueueInputDStream rather than 
> failing the application when recovering `QueueInputDStream`, so that people 
> can find the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12007) Network library's RPC layer requires a lot of copying

2016-02-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169362#comment-15169362
 ] 

Marcelo Vanzin commented on SPARK-12007:


[~xukun] same as with closed github PRs, unless you're 100% certain that it's 
the same bug, do not ask questions in closed bugs.

We have mailing lists for that: http://spark.apache.org/community.html

Or if you want, open a *new* bug with enough information for reproducing the 
issue:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports


> Network library's RPC layer requires a lot of copying
> -
>
> Key: SPARK-12007
> URL: https://issues.apache.org/jira/browse/SPARK-12007
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> The network library's RPC layer has an external API based on byte arrays, 
> instead of ByteBuffer; that requires a lot of copying since the internals of 
> the library use ByteBuffers (or rather Netty's ByteBuf), and lots of external 
> clients also use ByteBuffer.
> The extra copies could be avoided if the API used ByteBuffer instead.
> To show an extreme case, look at an RPC send via NettyRpcEnv:
> - message is encoded using JavaSerializer, resulting in a ByteBuffer
> - the ByteBuffer is copied into a byte array of the right size, since its 
> internal array may be larger than the actual data it holds
> - the network library's encoder copies the byte array into a ByteBuf
> - finally the data is written to the socket
> The intermediate 2 copies could be avoided if the API allowed the original 
> ByteBuffer to be sent instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13445) Selecting "data" with window function does not work unless aliased (using PARTITION BY)

2016-02-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169352#comment-15169352
 ] 

Xiao Li commented on SPARK-13445:
-

Since 2.0, we have a native support for the window functions: 
https://issues.apache.org/jira/browse/SPARK-8641 

Our new implementation enforces users to explicitly specify the order by 
clauses. In 1.6, we still used Hive UDAFs for window functions. Hive UDAFs 
row_number() does not require it. Thus, we did not see the error message.

Let me know if we need to add extra logics to enforce it. Thanks! 

> Selecting "data" with window function does not work unless aliased (using 
> PARTITION BY)
> ---
>
> Key: SPARK-13445
> URL: https://issues.apache.org/jira/browse/SPARK-13445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Reynold Xin
>Priority: Critical
>
> The code does not throw an exception if "data" is aliased.  Maybe this is a 
> reserved word or aliases are just required when using PARTITION BY?
> {code}
> sql("""
>   SELECT 
> data as the_data,
> row_number() over (partition BY data.type) AS foo
>   FROM event_record_sample
> """)
> {code}
> However, this code throws an error:
> {code}
> sql("""
>   SELECT 
> data,
> row_number() over (partition BY data.type) AS foo
>   FROM event_record_sample
> """)
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) type#15246 
> missing from 
> data#15107,par_cat#15112,schemaMajorVersion#15110,source#15108,recordId#15103,features#15106,eventType#15105,ts#15104L,schemaMinorVersion#15111,issues#15109
>  in operator !Project [data#15107,type#15246];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2016-02-26 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169346#comment-15169346
 ] 

Anbu Cheeralan commented on SPARK-11085:


I use Hortonworks. I am able to resolve this by doing the following
1. create javaopts file in SPARK_HOME/conf folder
2. add all javaopts like below:
-Dhttp.proxyHost=proxy.host 
-Dhttp.proxyPort=8080

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2016-02-26 Thread etienne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169286#comment-15169286
 ] 

etienne commented on SPARK-8630:


Trying to update version 1.4.0 I encounter the same problem [~asimjalis].
Because this PR was not reverted since ( maybe in latter version ?).
What is the alternative way to test reduceByKey and other step that use 
checkpointing?

> Prevent from checkpointing QueueInputDStream
> 
>
> Key: SPARK-8630
> URL: https://issues.apache.org/jira/browse/SPARK-8630
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.4.1, 1.5.0
>
>
> It's better to prevent from checkpointing QueueInputDStream rather than 
> failing the application when recovering `QueueInputDStream`, so that people 
> can find the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example

2016-02-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11381.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11320
[https://github.com/apache/spark/pull/11320]

> Replace example code in mllib-linear-methods.md using include_example
> -
>
> Key: SPARK-11381
> URL: https://issues.apache.org/jira/browse/SPARK-11381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xusen Yin
>Assignee: Dongjoon Hyun
>  Labels: starter
> Fix For: 2.0.0
>
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-linear-methods.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-02-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12634.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11353
[https://github.com/apache/spark/pull/11353]

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
> Fix For: 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13457) Remove DataFrame RDD operations

2016-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13457.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11388
[https://github.com/apache/spark/pull/11388]

> Remove DataFrame RDD operations
> ---
>
> Key: SPARK-13457
> URL: https://issues.apache.org/jira/browse/SPARK-13457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> We'd like to remove DataFrame RDD operations like {{map}}, {{filter}}, and 
> {{foreach}} because:
> # After making DataFrame a subclass of {{Dataset\[Row\]}}, these methods 
> conflicts with methods in Dataset.
> # By returning RDDs, they are semantically improper.
> It's trivial to remove them since they simply delegates to methods of 
> {{DataFrame.rdd}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11691) Allow to specify compression codec in HadoopFsRelation when saving

2016-02-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169216#comment-15169216
 ] 

Sean Owen commented on SPARK-11691:
---

OK, typically we link the PR here (which is done automatically when you title 
the PR with the JIRA number) and note it was resolved by a certain PR. That is, 
if it's Fixed, the JIRA needs to point to what fixed it. 

What I'm not clear on is how it relates to the other 2 existing PRs. Are they 
also needed?


> Allow to specify compression codec in HadoopFsRelation when saving 
> ---
>
> Key: SPARK-11691
> URL: https://issues.apache.org/jira/browse/SPARK-11691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> Currently, there's no way to specify compression codec when saving data frame 
> to hdfs. It would nice to allow specify compression codec in DataFrameWriter 
> just as we did in RDD api
> {code}
> def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = 
> withScope {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11691) Allow to specify compression codec in HadoopFsRelation when saving

2016-02-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169160#comment-15169160
 ] 

Hyukjin Kwon edited comment on SPARK-11691 at 2/26/16 3:10 PM:
---

This issue deals with a bit more generalized compression options compareing to 
the issue I gave here. However, for general Hadoop compression configurations, 
they can only be applied to JSON, CSV and TEXT datasources which I already 
submitted some PRs and they were merged. So, although the issues themselves are 
slightly different, I think the PRs I submitted cover this. 

Also, I think we can't just assume all the {{HadoopFsRelation}}s support 
compression. For ORC and Parquet, they might have to be dealt with differently 
due to dofferent configuration keys and supporting codecs.

Should we then move all the issues about compression codecs for each data 
source to this issue as sub-tasks?


was (Author: hyukjin.kwon):
This issue deals with a bit more generalized compression options compareing to 
the issue I gave here. However, for general Hadoop compression configurations, 
they can only be applied to JSON, CSV and TEXT datasources which I already 
submitted some PRs and they were merged. So, although the issues themselves are 
slightly different, I think the PRs I submitted cover this. 

Also, I think we can't just assume all the {{HadoopFsRelation}}s support 
compression. For ORC and Parquet, they might have to be dealt with differently 
due to dofferent configuration keys and supporting codecs.

> Allow to specify compression codec in HadoopFsRelation when saving 
> ---
>
> Key: SPARK-11691
> URL: https://issues.apache.org/jira/browse/SPARK-11691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> Currently, there's no way to specify compression codec when saving data frame 
> to hdfs. It would nice to allow specify compression codec in DataFrameWriter 
> just as we did in RDD api
> {code}
> def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = 
> withScope {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >