date:20160501

[jira] [Commented] (SPARK-3190) Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow somewhere

2016-05-01 Thread Yuance Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266145#comment-15266145
 ] 

Yuance Li commented on SPARK-3190:
--

The PR{2106，7923} can not fix the problem completely, as  the number of 
vertices in one of partition exceed Integer.MAX_VALUE also can repreduce this 
Bug。 The fundamental cause of this problem is the variable “size” is defined as 
type Int  in class VertexPartitionBase.

> Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow 
> somewhere 
> ---
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 . Using latest code from 
> master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
>Reporter: npanj
>Assignee: Ankur Dave
>Priority: Critical
> Fix For: 1.0.3, 1.1.0, 1.2.0, 1.3.2, 1.4.2, 1.5.0
>
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 
> 'numVertices' api returns incorrect result; 'numEdges' reports correct 
> number. For few times(with different dataset > 2.5B nodes) I have also 
> notices that numVertices is returned as -ive number; so I suspect that there 
> is some overflow (may be we are using Int for some field?).
> Here is some details of experiments  I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
>Graph returns: numVertices=1807028297 ;  numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
>Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
>Graph: numVertices=1725060105 ;  numEdges=2041768213
> You can find the code to generate this bug here: 
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> Note: Nodes are labeled are 1...6B .
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15053) Fix Java Lint errors on Hive-Thriftserver module

2016-05-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15053:
--
Component/s: Build

> Fix Java Lint errors on Hive-Thriftserver module
> 
>
> Key: SPARK-15053
> URL: https://issues.apache.org/jira/browse/SPARK-15053
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 
> which copied hive service code from Hive. We had better clean up these errors 
> before releasing Spark 2.0. 
> * Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap 
> (9 lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), 
> ArrayTypeStyle (3 lines), ModifierOrder (3 lines), RedundantImport (1 line), 
> CommentsIndentation (1 line), UpperEll (1 line), FallThrough (1 line), 
> OneStatementPerLine (1 line), NewlineAtEndOfFile (1 line).
> * Ignore `LineLength` errors under `hive/service/*` (118 lines).
> * Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line).
> * Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-01 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266116#comment-15266116
 ] 

Xin Wu commented on SPARK-15044:


I tried {code}alter table test drop partition (p=1){code} , then the select 
will return 0 rows without exception. 

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin resolved SPARK-14302.
---
Resolution: Won't Fix

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266093#comment-15266093
 ] 

Xusen Yin commented on SPARK-14302:
---

I'll close it, anything else I'll let you know. Thanks!

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15054) Deprecate old accumulator API

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15054:


Assignee: Reynold Xin  (was: Apache Spark)

> Deprecate old accumulator API
> -
>
> Key: SPARK-15054
> URL: https://issues.apache.org/jira/browse/SPARK-15054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15054) Deprecate old accumulator API

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266090#comment-15266090
 ] 

Apache Spark commented on SPARK-15054:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12832

> Deprecate old accumulator API
> -
>
> Key: SPARK-15054
> URL: https://issues.apache.org/jira/browse/SPARK-15054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15053) Fix Java Lint errors on Hive-Thriftserver module

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15053:


Assignee: Apache Spark

> Fix Java Lint errors on Hive-Thriftserver module
> 
>
> Key: SPARK-15053
> URL: https://issues.apache.org/jira/browse/SPARK-15053
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 
> which copied hive service code from Hive. We had better clean up these errors 
> before releasing Spark 2.0. 
> * Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap 
> (9 lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), 
> ArrayTypeStyle (3 lines), ModifierOrder (3 lines), RedundantImport (1 line), 
> CommentsIndentation (1 line), UpperEll (1 line), FallThrough (1 line), 
> OneStatementPerLine (1 line), NewlineAtEndOfFile (1 line).
> * Ignore `LineLength` errors under `hive/service/*` (118 lines).
> * Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line).
> * Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15054) Deprecate old accumulator API

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15054:


Assignee: Apache Spark  (was: Reynold Xin)

> Deprecate old accumulator API
> -
>
> Key: SPARK-15054
> URL: https://issues.apache.org/jira/browse/SPARK-15054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15053) Fix Java Lint errors on Hive-Thriftserver module

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15053:


Assignee: (was: Apache Spark)

> Fix Java Lint errors on Hive-Thriftserver module
> 
>
> Key: SPARK-15053
> URL: https://issues.apache.org/jira/browse/SPARK-15053
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 
> which copied hive service code from Hive. We had better clean up these errors 
> before releasing Spark 2.0. 
> * Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap 
> (9 lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), 
> ArrayTypeStyle (3 lines), ModifierOrder (3 lines), RedundantImport (1 line), 
> CommentsIndentation (1 line), UpperEll (1 line), FallThrough (1 line), 
> OneStatementPerLine (1 line), NewlineAtEndOfFile (1 line).
> * Ignore `LineLength` errors under `hive/service/*` (118 lines).
> * Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line).
> * Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15053) Fix Java Lint errors on Hive-Thriftserver module

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266089#comment-15266089
 ] 

Apache Spark commented on SPARK-15053:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12831

> Fix Java Lint errors on Hive-Thriftserver module
> 
>
> Key: SPARK-15053
> URL: https://issues.apache.org/jira/browse/SPARK-15053
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 
> which copied hive service code from Hive. We had better clean up these errors 
> before releasing Spark 2.0. 
> * Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap 
> (9 lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), 
> ArrayTypeStyle (3 lines), ModifierOrder (3 lines), RedundantImport (1 line), 
> CommentsIndentation (1 line), UpperEll (1 line), FallThrough (1 line), 
> OneStatementPerLine (1 line), NewlineAtEndOfFile (1 line).
> * Ignore `LineLength` errors under `hive/service/*` (118 lines).
> * Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line).
> * Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15054) Deprecate old accumulator API

2016-05-01 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15054:
---

 Summary: Deprecate old accumulator API
 Key: SPARK-15054
 URL: https://issues.apache.org/jira/browse/SPARK-15054
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15053) Fix Java Lint errors on Hive-Thriftserver module

2016-05-01 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-15053:
-

 Summary: Fix Java Lint errors on Hive-Thriftserver module
 Key: SPARK-15053
 URL: https://issues.apache.org/jira/browse/SPARK-15053
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Dongjoon Hyun
Priority: Trivial


This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 
which copied hive service code from Hive. We had better clean up these errors 
before releasing Spark 2.0. 

* Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap (9 
lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), ArrayTypeStyle (3 
lines), ModifierOrder (3 lines), RedundantImport (1 line), CommentsIndentation 
(1 line), UpperEll (1 line), FallThrough (1 line), OneStatementPerLine (1 
line), NewlineAtEndOfFile (1 line).
* Ignore `LineLength` errors under `hive/service/*` (118 lines).
* Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line).
* Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15052) Use builder pattern to create SparkSession

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15052:


Assignee: Apache Spark  (was: Reynold Xin)

> Use builder pattern to create SparkSession
> --
>
> Key: SPARK-15052
> URL: https://issues.apache.org/jira/browse/SPARK-15052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15052) Use builder pattern to create SparkSession

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266088#comment-15266088
 ] 

Apache Spark commented on SPARK-15052:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12830

> Use builder pattern to create SparkSession
> --
>
> Key: SPARK-15052
> URL: https://issues.apache.org/jira/browse/SPARK-15052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15052) Use builder pattern to create SparkSession

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15052:


Assignee: Reynold Xin  (was: Apache Spark)

> Use builder pattern to create SparkSession
> --
>
> Key: SPARK-15052
> URL: https://issues.apache.org/jira/browse/SPARK-15052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15052) Use builder pattern to create SparkSession

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15052:

Summary: Use builder pattern to create SparkSession  (was: Add ways to 
create SparkSession without requiring explicitly creating SparkContext first)

> Use builder pattern to create SparkSession
> --
>
> Key: SPARK-15052
> URL: https://issues.apache.org/jira/browse/SPARK-15052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15052) Add ways to create SparkSession without requiring explicitly creating SparkContext first

2016-05-01 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15052:
---

 Summary: Add ways to create SparkSession without requiring 
explicitly creating SparkContext first
 Key: SPARK-15052
 URL: https://issues.apache.org/jira/browse/SPARK-15052
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14931) Mismatched default Param values between pipelines in Spark and PySpark

2016-05-01 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14931:

Component/s: PySpark
 ML

> Mismatched default Param values between pipelines in Spark and PySpark
> --
>
> Key: SPARK-14931
> URL: https://issues.apache.org/jira/browse/SPARK-14931
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: ML, PySpark
> Fix For: 2.0.0
>
>
> Mismatched default values between pipelines in Spark and PySpark lead to 
> different pipelines in PySpark after saving and loading.
> Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15049) Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15049.
-
Resolution: Fixed

> Rename NewAccumulator to AccumulatorV2
> --
>
> Key: SPARK-15049
> URL: https://issues.apache.org/jira/browse/SPARK-15049
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15045:


Assignee: (was: Apache Spark)

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-05-01 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266077#comment-15266077
 ] 

Xiao Li commented on SPARK-14974:
-

200w is 200 万. 万 is a Chinese unit. It means 10,000. : )

> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>Priority: Minor
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266078#comment-15266078
 ] 

Apache Spark commented on SPARK-15045:
--

User 'abhi951990' has created a pull request for this issue:
https://github.com/apache/spark/pull/12829

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15045:


Assignee: Apache Spark

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15051) Aggregator with DataFrame does not allow Alias

2016-05-01 Thread koert kuipers (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-15051:
--
Description: 
this works:
{noformat}
object SimpleSum extends Aggregator[Row, Int, Int] {
  def zero: Int = 0
  def reduce(b: Int, a: Row) = b + a.getInt(1)
  def merge(b1: Int, b2: Int) = b1 + b2
  def finish(b: Int) = b
  def bufferEncoder: Encoder[Int] = Encoders.scalaInt
  def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
df.groupBy("k").agg(SimpleSum.toColumn).show
{noformat}

but it breaks when i try to give the new column a name:
{noformat}
df.groupBy("k").agg(SimpleSum.toColumn as "b").show
{noformat}

the error is:
{noformat}
   org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[k#192], [k#192,(SimpleSum(unknown),mode=Complete,isDistinct=false) AS b#200];
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:270)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:51)
   at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:51)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:54)
   at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
{noformat}

the reason it breaks is because Column.as(alias: String) returns a Column not a 
TypedColumn, and as a result the method TypedColumn.withInputType does not get 
called.

P.S. The whole TypedColumn.withInputType seems actually rather fragile to me. I 
wish Aggregators simply also kept the input encoder and that whole bit can be 
removed about dynamically trying to insert the Encoder.

  was:
this works:
{noformat}
object SimpleSum extends Aggregator[Row, Int, Int] {
  def zero: Int = 0
  def reduce(b: Int, a: Row) = b + a.getInt(1)
  def merge(b1: Int, b2: Int) = b1 + b2
  def finish(b: Int) = b
  def bufferEncoder: Encoder[Int] = Encoders.scalaInt
  def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
df.groupBy("k").agg(SimpleSum.toColumn).show
{noformat}

but it breaks when i try to give the new column a name:
{noformat}
df.groupBy("k").agg(SimpleSum.toColumn as "b").show
{noformat}

the error is:
{noformat}
   org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[k#192], [k#192,(SimpleSum(unknown),mode=Complete,isDistinct=false) AS b#200];
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:270)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:51)
   at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:51)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:54)
   at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
{noformat}

the reason it breaks is because Column.as(alias: String) returns a Column not a 
TypedColumn, and as a result the method TypedColumn.withInputType does not get 
called.

P.S. The whole TypedColumn.withInputType seems actually rather fragile to me. I 
wish Aggregators simply also kept the input encoder and that whole bit can be 
removed about dynamically trying to insert it.


> Aggregator with DataFrame does not allow Alias
> --
>
> Key: SPARK-15051
> URL: https://issues.apache.org/jira/browse/SPARK-15051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>
> this works:
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
>

[jira] [Created] (SPARK-15051) Aggregator with DataFrame does not allow Alias

2016-05-01 Thread koert kuipers (JIRA)

koert kuipers created SPARK-15051:
-

 Summary: Aggregator with DataFrame does not allow Alias
 Key: SPARK-15051
 URL: https://issues.apache.org/jira/browse/SPARK-15051
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: Spark 2.0.0-SNAPSHOT
Reporter: koert kuipers


this works:
{noformat}
object SimpleSum extends Aggregator[Row, Int, Int] {
  def zero: Int = 0
  def reduce(b: Int, a: Row) = b + a.getInt(1)
  def merge(b1: Int, b2: Int) = b1 + b2
  def finish(b: Int) = b
  def bufferEncoder: Encoder[Int] = Encoders.scalaInt
  def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
df.groupBy("k").agg(SimpleSum.toColumn).show
{noformat}

but it breaks when i try to give the new column a name:
{noformat}
df.groupBy("k").agg(SimpleSum.toColumn as "b").show
{noformat}

the error is:
{noformat}
   org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[k#192], [k#192,(SimpleSum(unknown),mode=Complete,isDistinct=false) AS b#200];
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:270)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:51)
   at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:51)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:54)
   at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
{noformat}

the reason it breaks is because Column.as(alias: String) returns a Column not a 
TypedColumn, and as a result the method TypedColumn.withInputType does not get 
called.

P.S. The whole TypedColumn.withInputType seems actually rather fragile to me. I 
wish Aggregators simply also kept the input encoder and that whole bit can be 
removed about dynamically trying to insert it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14993:


Assignee: Apache Spark

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266073#comment-15266073
 ] 

Apache Spark commented on SPARK-14993:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12828

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14993:


Assignee: (was: Apache Spark)

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-01 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266069#comment-15266069
 ] 

Xin Wu edited comment on SPARK-14495 at 5/2/16 2:21 AM:


I can recreate it on branch-1.6. and another workaround is using alias for the 
aggregate expression
{code}
scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select 
'2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show
+--+---+
|  date|cnt|
+--+---+
|2010-01-01|  1|
+--+---+
{code}




was (Author: xwu0226):
I can recreated it on branch-1.6. and another workaround is using alias for the 
aggregate expression
{code}
scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select 
'2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show
+--+---+
|  date|cnt|
+--+---+
|2010-01-01|  1|
+--+---+
{code}



> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-01 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266069#comment-15266069
 ] 

Xin Wu commented on SPARK-14495:


I can recreated it on branch-1.6. and another workaround is using alias for the 
aggregate expression
{code}
scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select 
'2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show
+--+---+
|  date|cnt|
+--+---+
|2010-01-01|  1|
+--+---+
{code}



> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-01 Thread Abhinav Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266057#comment-15266057
 ] 

Abhinav Gupta commented on SPARK-15045:
---

I would like to work on this issue. 
Any suggestions on how to proceed on this issue?

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15050) Put CSV options as Python csv function parameters

2016-05-01 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15050:
---

 Summary: Put CSV options as Python csv function parameters
 Key: SPARK-15050
 URL: https://issues.apache.org/jira/browse/SPARK-15050
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13425) Documentation for CSV datasource options

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13425.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-01 Thread Niranjan Molkeri` (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266042#comment-15266042
 ] 

Niranjan Molkeri` commented on SPARK-15044:
---

I tried to reproduce the error in Spark 1.5.0, There is no problem with the 
path. May be we should also try to reproduce in 2.0. Can someone suggest me how 
to proceed with the bug. I will try to solve this bug. 

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14995) Add "since" tag in Roxygen documentation for SparkR API methods

2016-05-01 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266031#comment-15266031
 ] 

Sun Rui commented on SPARK-14995:
-

[~felixcheung]  I think no need to add "spark". Just something like
```
#' @note since 2.0.0
```


> Add "since" tag in Roxygen documentation for SparkR API methods
> ---
>
> Key: SPARK-14995
> URL: https://issues.apache.org/jira/browse/SPARK-14995
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is request adding something in SparkR API like "versionadded" in PySpark 
> API and "@since" in Scala/Java API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14995) Add "since" tag in Roxygen documentation for SparkR API methods

2016-05-01 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266031#comment-15266031
 ] 

Sun Rui edited comment on SPARK-14995 at 5/2/16 1:07 AM:
-

[~felixcheung]  I think no need to add "spark". Just something like
{panel}
#' @note since 2.0.0
{panel}



was (Author: sunrui):
[~felixcheung]  I think no need to add "spark". Just something like
```
#' @note since 2.0.0
```


> Add "since" tag in Roxygen documentation for SparkR API methods
> ---
>
> Key: SPARK-14995
> URL: https://issues.apache.org/jira/browse/SPARK-14995
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is request adding something in SparkR API like "versionadded" in PySpark 
> API and "@since" in Scala/Java API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266029#comment-15266029
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Works for me, so what else can I help with?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14864) [MLLIB] Implement Doc2Vec

2016-05-01 Thread Peter Mountanos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266024#comment-15266024
 ] 

Peter Mountanos edited comment on SPARK-14864 at 5/2/16 12:45 AM:
--

I will try to work out this feature if no one else has made any progress.


was (Author: peter.mounta...@nyu.edu):
I will try to work out this issue if no one else has made any progress.

> [MLLIB] Implement Doc2Vec
> -
>
> Key: SPARK-14864
> URL: https://issues.apache.org/jira/browse/SPARK-14864
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Peter Mountanos
>Priority: Minor
>
> It would be useful to implement Doc2Vec, as described in the paper 
> [Distributed Representations of Sentences and 
> Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has 
> an implementation [Deep learning with 
> paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. 
> Le & Mikolov show that when aggregating Word2Vec vector representations for a 
> paragraph/document, it does not perform well for prediction tasks. Instead, 
> they propose the Paragraph Vector implementation, which provides 
> state-of-the-art results on several text classification and sentiment 
> analysis tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec

2016-05-01 Thread Peter Mountanos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266024#comment-15266024
 ] 

Peter Mountanos commented on SPARK-14864:
-

I will try to work out this issue if no one else has made any progress.

> [MLLIB] Implement Doc2Vec
> -
>
> Key: SPARK-14864
> URL: https://issues.apache.org/jira/browse/SPARK-14864
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Peter Mountanos
>Priority: Minor
>
> It would be useful to implement Doc2Vec, as described in the paper 
> [Distributed Representations of Sentences and 
> Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has 
> an implementation [Deep learning with 
> paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. 
> Le & Mikolov show that when aggregating Word2Vec vector representations for a 
> paragraph/document, it does not perform well for prediction tasks. Instead, 
> they propose the Paragraph Vector implementation, which provides 
> state-of-the-art results on several text classification and sentiment 
> analysis tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266006#comment-15266006
 ] 

Xusen Yin commented on SPARK-14302:
---

[~kanjilal] Thanks for working on this. However, I check the duplicated 
examples again and find out that we should not delete all of them. As I 
depicted below:

* python/ml
** None

* Unsure duplications, double check
** dataframe_example.py  --> serves for an example of dataframe usage.
** kmeans_example.py  --> serves as an application
** simple_params_example.py  --> serves for an example of params usage.
** simple_text_classification_pipeline.py  --> serves as an application.

* python/mllib
** gaussian_mixture_model.py  --> serves as an application.
** kmeans.py  --> ditto
** logistic_regression.py  --> ditto

* Unsure duplications, double check
** correlations.py  --> ditto
** random_rdd_generation.py  --> ditto
** sampled_rdds.py  --> ditto
** word2vec.py  --> ditto

So I think we can close this JIRA as won't fix. What do you think about it?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15049) Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15049:


Assignee: Reynold Xin  (was: Apache Spark)

> Rename NewAccumulator to AccumulatorV2
> --
>
> Key: SPARK-15049
> URL: https://issues.apache.org/jira/browse/SPARK-15049
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15049) Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266004#comment-15266004
 ] 

Apache Spark commented on SPARK-15049:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12827

> Rename NewAccumulator to AccumulatorV2
> --
>
> Key: SPARK-15049
> URL: https://issues.apache.org/jira/browse/SPARK-15049
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15049) Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15049:


Assignee: Apache Spark  (was: Reynold Xin)

> Rename NewAccumulator to AccumulatorV2
> --
>
> Key: SPARK-15049
> URL: https://issues.apache.org/jira/browse/SPARK-15049
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15049) Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15049:
---

 Summary: Rename NewAccumulator to AccumulatorV2
 Key: SPARK-15049
 URL: https://issues.apache.org/jira/browse/SPARK-15049
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15047) Cleanup SQLParser

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15047:


Assignee: Apache Spark  (was: Herman van Hovell)

> Cleanup SQLParser
> -
>
> Key: SPARK-15047
> URL: https://issues.apache.org/jira/browse/SPARK-15047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> We have made major changes to the SQL parser recently. Some of the code in 
> the parser is currently deadcode. We should remove this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15047) Cleanup SQLParser

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266001#comment-15266001
 ] 

Apache Spark commented on SPARK-15047:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12826

> Cleanup SQLParser
> -
>
> Key: SPARK-15047
> URL: https://issues.apache.org/jira/browse/SPARK-15047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> We have made major changes to the SQL parser recently. Some of the code in 
> the parser is currently deadcode. We should remove this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15047) Cleanup SQLParser

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15047:


Assignee: Herman van Hovell  (was: Apache Spark)

> Cleanup SQLParser
> -
>
> Key: SPARK-15047
> URL: https://issues.apache.org/jira/browse/SPARK-15047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> We have made major changes to the SQL parser recently. Some of the code in 
> the parser is currently deadcode. We should remove this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7025) Create a Java-friendly input source API

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7025.
--
Resolution: Later

> Create a Java-friendly input source API
> ---
>
> Key: SPARK-7025
> URL: https://issues.apache.org/jira/browse/SPARK-7025
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The goal of this ticket is to create a simple input source API that we can 
> maintain and support long term.
> Spark currently has two de facto input source API:
> 1. RDD
> 2. Hadoop MapReduce InputFormat
> Neither of the above is ideal:
> 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
> class tags. In addition, the RDD API depends on Scala's runtime library, 
> which does not preserve binary compatibility across Scala versions. If a 
> developer chooses Java to implement an input source, it would be great if 
> that input source can be binary compatible in years to come.
> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
> example, it forces key-value semantics, and does not support running 
> arbitrary code on the driver side (an example of why this is useful is 
> broadcast). In addition, it is somewhat awkward to tell developers that in 
> order to implement an input source for Spark, they should learn the Hadoop 
> MapReduce API first.
> So here's the proposal: an InputSource is described by:
> * an array of InputPartition that specifies the data partitioning
> * a RecordReader that specifies how data on each partition can be read
> This interface would be similar to Hadoop's InputFormat, except that there is 
> no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13830) Fetch large directly result from executor is very slow

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13830.
-
   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 2.0.0

> Fetch large directly result from executor is very slow
> --
>
> Key: SPARK-13830
> URL: https://issues.apache.org/jira/browse/SPARK-13830
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Given two task with 100+M result on each, it take more than 50 seconds to 
> fetch the results.
> The RPC may be not designed to handle large block, we should use block 
> manager for that. But currently this is based on spark.rpc.message.maxSize, 
> which is usually very large (> 128M) for safe, it's too large for handling 
> results.
> We also counting the time to fetch the direct result (also deserialize it) as 
> schedule delay, it also make sense to only fetch much smaller blocks via 
> DirectResult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14060) Move StringToColumn implicit class into SQLImplicits

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14060.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move StringToColumn implicit class into SQLImplicits
> 
>
> Key: SPARK-14060
> URL: https://issues.apache.org/jira/browse/SPARK-14060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> They were kept in SQLContext.implicits object for binary backward 
> compatibility, in the Spark 1.x series. It makes more sense for this API to 
> be in SQLImplicits since that's the single class that defines all the SQL 
> implicits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15043:

Priority: Critical  (was: Blocker)

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Critical
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15048) when running Thriftserver with yarn on a secure cluster it will pass the wrong keytab location.

2016-05-01 Thread Trystan Leftwich (JIRA)

Trystan Leftwich created SPARK-15048:


 Summary: when running Thriftserver with yarn on a secure cluster 
it will pass the wrong keytab location.
 Key: SPARK-15048
 URL: https://issues.apache.org/jira/browse/SPARK-15048
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Trystan Leftwich


when running hive-thriftserver with yarn on a secure cluster it will pass the 
wrong keytab location.

{code}
16/05/01 19:33:52 INFO hive.HiveUtils: Initializing HiveMetastoreConnection 
version 1.2.1 using Spark classes.
Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
test.keytab-e3754e07-c798-4e6a-8745-c5f9d3483507 specified in spark.yarn.keytab 
does not exist
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:111)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:364)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:268)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:45)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:60)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:81)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/05/01 19:33:52 INFO spark.SparkContext: Invoking stop() from shutdown hook
{code}

Note: You will need the patch from SPARK-15046 before you can encounter this 
bug.

It looks like this specific commit caused this issue, 
https://github.com/apache/spark/commit/8301fadd8d269da11e72870b7a889596e3337839#diff-6fd847124f8eae45ba2de1cf7d6296feL93

Re-adding that one line fixes the bug. 
Its possible to "Reset" the config before Hive needs it:

i.e Adding code to similar to below to the following location:
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L57

{code}
diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
index 665a44e..0e32b87 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
@@ -55,6 +55,12 @@ private[hive] object SparkSQLEnv extends Logging {
   maybeKryoReferenceTracking.getOrElse("false"))

   sparkContext = new SparkContext(sparkConf)
+  if (sparkConf.contains("spark.yarn.principal")) {
+  sparkContext.conf.set("spark.yarn.principal", 
sparkConf.get("spark.yarn.principal"))
+  }
+  if (sparkConf.contains("spark.yarn.keytab")) {
+  sparkContext.conf.set("spark.yarn.keytab", 
sparkConf.get("spark.yarn.keytab"))
+  }
   sqlContext = SparkSession.withHiveSupport(sparkContext).wrapped
   val sessionState = sqlContext.sessionState.asInstanceOf[HiveSessionState]
   sessionState.metadataHive.setOut(new PrintStream(System.out, true, 
"UTF-8"))
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265958#comment-15265958
 ] 

Apache Spark commented on SPARK-14973:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12825

> The CrossValidator and TrainValidationSplit miss the seed when saving and 
> loading
> -
>
> Key: SPARK-14973
> URL: https://issues.apache.org/jira/browse/SPARK-14973
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Xusen Yin
>
> The CrossValidator and TrainValidationSplit miss the seed when saving and 
> loading. Need to fix both Spark side code and test suite, plus PySpark side 
> code and test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15047) Cleanup SQLParser

2016-05-01 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-15047:
-

 Summary: Cleanup SQLParser
 Key: SPARK-15047
 URL: https://issues.apache.org/jira/browse/SPARK-15047
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor


We have made major changes to the SQL parser recently. Some of the code in the 
parser is currently deadcode. We should remove this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15046:


Assignee: (was: Apache Spark)

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15046:


Assignee: Apache Spark

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>Assignee: Apache Spark
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265902#comment-15265902
 ] 

Apache Spark commented on SPARK-15046:
--

User 'trystanleftwich' has created a pull request for this issue:
https://github.com/apache/spark/pull/12824

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14931) Mismatched default Param values between pipelines in Spark and PySpark

2016-05-01 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14931.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12816
[https://github.com/apache/spark/pull/12816]

> Mismatched default Param values between pipelines in Spark and PySpark
> --
>
> Key: SPARK-14931
> URL: https://issues.apache.org/jira/browse/SPARK-14931
> Project: Spark
>  Issue Type: Bug
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: ML, PySpark
> Fix For: 2.0.0
>
>
> Mismatched default values between pipelines in Spark and PySpark lead to 
> different pipelines in PySpark after saving and loading.
> Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-05-01 Thread Trystan Leftwich (JIRA)

Trystan Leftwich created SPARK-15046:


 Summary: When running hive-thriftserver with yarn on a secure 
cluster the workers fail with java.lang.NumberFormatException
 Key: SPARK-15046
 URL: https://issues.apache.org/jira/browse/SPARK-15046
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Trystan Leftwich


When running hive-thriftserver with yarn on a secure cluster 
(spark.yarn.principal and spark.yarn.keytab are set) the workers fail with the 
following error.

{code}
16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.lang.NumberFormatException: For input string: "86400079ms"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.parseLong(Long.java:483)
at 
scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
at 
org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
at 
org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
at scala.Option.map(Option.scala:146)
at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
at 
org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
at 
org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
at 
org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
at 
org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
at 
org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7898) pyspark merges stderr into stdout

2016-05-01 Thread Sam Steingold (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265863#comment-15265863
 ] 

Sam Steingold commented on SPARK-7898:
--

No, this is _NOT_ what I am talking about!
I (pyspark driver) want to *read* from the {{stdout}} from my subprocess, and 
pyspark messes up that subprocess's {{stdout}} and {{stderr}}.


> pyspark merges stderr into stdout
> -
>
> Key: SPARK-7898
> URL: https://issues.apache.org/jira/browse/SPARK-7898
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Sam Steingold
>
> When I type 
> {code}
> hadoop fs -text /foo/bar/baz.bz2 2>err 1>out
> {code}
> I get two non-empty files: {{err}} with 
> {code}
> 2015-05-26 15:33:49,786 INFO  [main] bzip2.Bzip2Factory 
> (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & 
> initialized native-bzip2 library system-native
> 2015-05-26 15:33:49,789 INFO  [main] compress.CodecPool 
> (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2]
> {code}
> and {{out}} with the content of the file (as expected).
> When I call the same command from Python (2.6):
> {code}
> from subprocess import Popen
> with open("out","w") as out:
> with open("err","w") as err:
> p = Popen(['hadoop','fs','-text',"/foo/bar/baz.bz2"],
>   stdin=None,stdout=out,stderr=err)
> print p.wait()
> {code}
> I get the exact same (correct) behavior.
> *However*, when I run the same code under *PySpark* (or using 
> {{spark-submit}}), I get an *empty* {{err}} file and the {{out}} file starts 
> with the log messages above (and then it contains the actual data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-05-01 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265805#comment-15265805
 ] 

Sandeep Singh commented on SPARK-928:
-

[~joshrosen] I've started working on it.

I tried benchmarking the difference between unsafe kryo and our current impl. 
and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned.

{code:title=Without Kryo UnSafe|borderStyle=solid}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

Serialize and then deserialize: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
---
primitive:Long  1 /4  11223.1   0.1 
  1.0X
primitive:Double1 /1  19409.0   0.1 
  1.7X
Array:Long 38 /   49412.4   2.4 
  0.0X
Array:Double   25 /   35631.4   1.6 
  0.1X
Map of string->Double2651 / 2766  5.9 168.6 
  0.0X
{code}

{code:title=With Kryo UnSafe|borderStyle=solid}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

Serialize and then deserialize: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
---
primitive:Long  1 /3  15872.0   0.1 
  1.0X
primitive:Double1 /1  17769.7   0.1 
  1.1X
Array:Long 24 /   42642.3   1.6 
  0.0X
Array:Double   22 /   26719.4   1.4 
  0.0X
Map of string->Double2560 / 2582  6.1 162.8 
  0.0X
{code}

You can find the code for benchmarking here 
(https://github.com/techaddict/spark/commit/46fa44141c849ca15bbd6136cea2fa52bd927da2),
 very ugly right now but will improve it(add more benchmarks) before creating a 
PR.


> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14985) Update LinearRegression, LogisticRegression summary internals to handle model copy

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14985:


Assignee: Apache Spark

> Update LinearRegression, LogisticRegression summary internals to handle model 
> copy
> --
>
> Key: SPARK-14985
> URL: https://issues.apache.org/jira/browse/SPARK-14985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> See parent JIRA + the PR for [SPARK-14852] for details.  The summaries should 
> handle creating an internal copy of the model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14985) Update LinearRegression, LogisticRegression summary internals to handle model copy

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265803#comment-15265803
 ] 

Apache Spark commented on SPARK-14985:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/12823

> Update LinearRegression, LogisticRegression summary internals to handle model 
> copy
> --
>
> Key: SPARK-14985
> URL: https://issues.apache.org/jira/browse/SPARK-14985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See parent JIRA + the PR for [SPARK-14852] for details.  The summaries should 
> handle creating an internal copy of the model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14985) Update LinearRegression, LogisticRegression summary internals to handle model copy

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14985:


Assignee: (was: Apache Spark)

> Update LinearRegression, LogisticRegression summary internals to handle model 
> copy
> --
>
> Key: SPARK-14985
> URL: https://issues.apache.org/jira/browse/SPARK-14985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See parent JIRA + the PR for [SPARK-14852] for details.  The summaries should 
> handle creating an internal copy of the model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-05-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13289:
--
Assignee: Junyang Shen  (was: Nick Pentreath)

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>Assignee: Junyang Shen
>  Labels: features
> Fix For: 2.0.0
>
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-05-01 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265794#comment-15265794
 ] 

Xiao Li commented on SPARK-14993:
-

Doing it now. Thanks!

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14505) Creating two SparkContext Object in the same jvm, the first one will can not run any tasks!

2016-05-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14505:
--
  Assignee: The sea
Issue Type: Bug  (was: Improvement)

> Creating two SparkContext Object in the same jvm, the first one will can not  
> run any tasks!
> 
>
> Key: SPARK-14505
> URL: https://issues.apache.org/jira/browse/SPARK-14505
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: The sea
>Assignee: The sea
>Priority: Minor
> Fix For: 2.0.0
>
>
> Execute code below in spark shell:
> import org.apache.spark.SparkContext
> val sc = new SparkContext("local", "app")
> sc.range(1, 10).reduce(_ + _)
> The exception is :
> 16/04/09 15:40:01 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 
> (TID 3, 192.168.172.131): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of 
> broadcast_1
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 
> of broadcast_1
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
>   ... 11 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14505) Creating two SparkContext Object in the same jvm, the first one will can not run any tasks!

2016-05-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14505.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12273
[https://github.com/apache/spark/pull/12273]

> Creating two SparkContext Object in the same jvm, the first one will can not  
> run any tasks!
> 
>
> Key: SPARK-14505
> URL: https://issues.apache.org/jira/browse/SPARK-14505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: The sea
>Priority: Minor
> Fix For: 2.0.0
>
>
> Execute code below in spark shell:
> import org.apache.spark.SparkContext
> val sc = new SparkContext("local", "app")
> sc.range(1, 10).reduce(_ + _)
> The exception is :
> 16/04/09 15:40:01 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 
> (TID 3, 192.168.172.131): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of 
> broadcast_1
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 
> of broadcast_1
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
>   ... 11 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-1239) Improve fetching of map output statuses

2016-05-01 Thread Chris Bannister (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bannister updated SPARK-1239:
---
Comment: was deleted

(was: Im seeing frequent job failures where the executors are unable to 
communicate with driver using MapStatus fetches, is there any plan to backport 
this to the 1.6 branch?)

> Improve fetching of map output statuses
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1239) Improve fetching of map output statuses

2016-05-01 Thread Chris Bannister (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265756#comment-15265756
 ] 

Chris Bannister commented on SPARK-1239:


Im seeing frequent job failures where the executors are unable to communicate 
with driver using MapStatus fetches, is there any plan to backport this to the 
1.6 branch?

> Improve fetching of map output statuses
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14785) Support correlated scalar subquery

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265731#comment-15265731
 ] 

Apache Spark commented on SPARK-14785:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12822

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-01 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-15045:
---

 Summary: Remove dead code in 
TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
 Key: SPARK-15045
 URL: https://issues.apache.org/jira/browse/SPARK-15045
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Jacek Laskowski
Priority: Trivial


Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
in a synchronized block and right after the block it does it again. I think the 
outside cleaning is a dead code.

See 
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
 with the relevant snippet pasted below:

{code}
  public long cleanUpAllAllocatedMemory() {
synchronized (this) {
  Arrays.fill(pageTable, null);
  ...
}

for (MemoryBlock page : pageTable) {
  if (page != null) {
memoryManager.tungstenMemoryAllocator().free(page);
  }
}
Arrays.fill(pageTable, null);
   ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15043:


Assignee: Apache Spark  (was: Sean Owen)

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265681#comment-15265681
 ] 

Apache Spark commented on SPARK-15043:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12821

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15043:


Assignee: Sean Owen  (was: Apache Spark)

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-15043:
-

Assignee: Sean Owen

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-01 Thread huangyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangyu updated SPARK-15044:

Description: 
spark-sql will throw "input path not exist" exception if it handles a partition 
which exists in hive table, but the path is removed manually.The situation is 
as follows:

1) Create a table "test". "create table test (n string) partitioned by (p 
string)"
2) Load some data into partition(p='1')
3)Remove the path related to partition(p='1') of table test manually. "hadoop 
fs -rmr /warehouse//test/p=1"
4)Run spark sql, spark-sql -e "select n from test where p='1';"

Then it throws exception:

{code}
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
./test/p=1
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
{code}
The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
I think spark-sql should ignore the path, just like hive or it dose in early 
versions, rather than throw an exception.

  was:
spark-sql will throw "input path not exist" exception if it handles a partition 
which exists in hive table, but the path is removed manually.The situation is 
as follows:

1) Create a table "test". "create table test (n string) partitioned by (p 
string)"
2) Load some data into partition(p='1')
3)Remove the path related to partition(p='1') of table test manually. "hadoop 
fs -rmr /warehouse//test/p=1"
4)Run spark sql, spark-sql -e "select n from test where p='1';"

Then it throws exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
./test/p=1
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at

[jira] [Assigned] (SPARK-14781) Support subquery in nested predicates

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14781:


Assignee: Apache Spark  (was: Davies Liu)

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-05-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265658#comment-15265658
 ] 

Apache Spark commented on SPARK-14781:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12820

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14781) Support subquery in nested predicates

2016-05-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14781:


Assignee: Davies Liu  (was: Apache Spark)

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-01 Thread huangyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangyu updated SPARK-15044:

Description: 
spark-sql will throw "input path not exist" exception if it handles a partition 
which exists in hive table, but the path is removed manually.The situation is 
as follows:

1) Create a table "test". "create table test (n string) partitioned by (p 
string)"
2) Load some data into partition(p='1')
3)Remove the path related to partition(p='1') of table test manually. "hadoop 
fs -rmr /warehouse//test/p=1"
4)Run spark sql, spark-sql -e "select n from test where p='1';"

Then it throws exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
./test/p=1
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)

The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
I think spark-sql should ignore the path, just like hive or it dose in early 
versions, rather than throw an exception.

  was:
spark-sql will throw "input path not exist" exception if it handles a partition 
which exists in hive table, but the path is removed manually.The situation is 
as follows:

1) Create a table "test". "create table test (n string) partitioned by (p 
string)"
2) Load some data into partition(p='1')
3)Remove the path related to partition(p='1') of table test manually. "hadoop 
fs -rmr /warehouse//test/p=1"
4)Run spark sql, spark-sql -e "select n from test where p='1';"

Then it throws exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
./test/p=1
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at

[jira] [Created] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-01 Thread huangyu (JIRA)

huangyu created SPARK-15044:
---

 Summary: spark-sql will throw "input path does not exist" 
exception if it handles a partition which exists in hive table, but the path is 
removed manually
 Key: SPARK-15044
 URL: https://issues.apache.org/jira/browse/SPARK-15044
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: huangyu


spark-sql will throw "input path not exist" exception if it handles a partition 
which exists in hive table, but the path is removed manually.The situation is 
as follows:

1) Create a table "test". "create table test (n string) partitioned by (p 
string)"
2) Load some data into partition(p='1')
3)Remove the path related to partition(p='1') of table test manually. "hadoop 
fs -rmr /warehouse//test/p=1"
4)Run spark sql, spark-sql -e "select n from test where p='1';"

Then it throws exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
./test/p=1
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)

The situation is in spark 1.6.1, if I use spark 1.4.0, It is OK
I think spark-sql should ignore the path, just like hive or it dose in early 
versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265655#comment-15265655
 ] 

Sean Owen commented on SPARK-15043:
---

I'll take a look and try to fix it. I didn't see that one while testing but 
might be some odd order of evaluation in the RDD that makes this tiny 
difference.

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14781) Support subquery in nested predicates

2016-05-01 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-14781:
--

Assignee: Davies Liu

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes

2016-05-01 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265652#comment-15265652
 ] 

zhengruifeng commented on SPARK-14077:
--

OK, I will have a try

> Support weighted instances in naive Bayes
> -
>
> Key: SPARK-14077
> URL: https://issues.apache.org/jira/browse/SPARK-14077
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>  Labels: naive-bayes
>
> In naive Bayes, we expect inputs to be individual observations. In practice, 
> people may have the frequency table instead. It is useful for us to support 
> instance weights to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14422) Improve handling of optional configs in SQLConf

2016-05-01 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265631#comment-15265631
 ] 

Sandeep Singh commented on SPARK-14422:
---

:) Cool, For this particular one do you already have anything in mind ?

> Improve handling of optional configs in SQLConf
> ---
>
> Key: SPARK-14422
> URL: https://issues.apache.org/jira/browse/SPARK-14422
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> As Michael showed here: 
> https://github.com/apache/spark/pull/12119/files/69aa1a005cc7003ab62d6dfcdef42181b053eaed#r58634150
> Handling of optional configs in SQLConf is a little sub-optimal right now. We 
> should clean that up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

87 matches

Mail list logo