[jira] [Updated] (SPARK-10771) Implement the shuffle encryption with AES-CTR crypto using JCE key provider.

2015-09-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10771:
--
Priority: Minor  (was: Major)

> Implement the shuffle encryption with AES-CTR crypto using JCE key provider.
> 
>
> Key: SPARK-10771
> URL: https://issues.apache.org/jira/browse/SPARK-10771
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Ferdinand Xu
>Priority: Minor
>
> We will use the credentials stored in user group information to encrypt/ 
> decrypt shuffle data. We will use JCE key provider to implement AES-CTR 
> crypto.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10769) Fix o.a.s.streaming.CheckpointSuite.maintains rate controller

2015-09-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10769.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> Fix o.a.s.streaming.CheckpointSuite.maintains rate controller
> -
>
> Key: SPARK-10769
> URL: https://issues.apache.org/jira/browse/SPARK-10769
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> Fixed the following failure in 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 660 times over 10.4439201 seconds. Last failure 
> message: 9223372036854775807 did not equal 200.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10769) Fix o.a.s.streaming.CheckpointSuite.maintains rate controller

2015-09-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10769:
--
Assignee: Shixiong Zhu

> Fix o.a.s.streaming.CheckpointSuite.maintains rate controller
> -
>
> Key: SPARK-10769
> URL: https://issues.apache.org/jira/browse/SPARK-10769
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> Fixed the following failure in 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/
> {code}
> sbt.ForkMain$ForkError: The code passed to eventually never returned 
> normally. Attempted 660 times over 10.4439201 seconds. Last failure 
> message: 9223372036854775807 did not equal 200.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-23 Thread rerngvit yanggratoke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904075#comment-14904075
 ] 

rerngvit yanggratoke edited comment on SPARK-9798 at 9/23/15 8:15 AM:
--

Ok. I will work on this task then.
I am new here and would like to give it a try.


was (Author: rerngvit):
Ok, may I be assigned to this task then?
I am new here and would like to give it a try.

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2737) ClassCastExceptions when collect()ing JavaRDDs' underlying Scala RDDs

2015-09-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904141#comment-14904141
 ] 

Sean Owen commented on SPARK-2737:
--

[~glenn.stryc...@gmail.com] you can use JIRA to link issues if you're pretty 
sure they're related. It's more visible than in a comment.

> ClassCastExceptions when collect()ing JavaRDDs' underlying Scala RDDs
> -
>
> Key: SPARK-2737
> URL: https://issues.apache.org/jira/browse/SPARK-2737
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 0.8.0, 0.9.0, 1.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> The Java API's use of fake ClassTags doesn't seem to cause any problems for 
> Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs 
> to Scala code (e.g. in the MLlib Java API wrapper code).  If we call 
> {{collect()}} on a Scala RDD with an incorrect ClassTag, this causes 
> ClassCastExceptions when we try to allocate an array of the wrong type (for 
> example, see SPARK-2197).
> There are a few possible fixes here.  An API-breaking fix would be to 
> completely remove the fake ClassTags and require Java API users to pass 
> {{java.lang.Class}} instances to all {{parallelize()}} calls and add 
> {{returnClass}} fields to all {{Function}} implementations.  This would be 
> extremely verbose.
> Instead, I propose that we add internal APIs to "repair" a Scala RDD with an 
> incorrect ClassTag by wrapping it and overriding its ClassTag.  This should 
> be okay for cases where the Scala code that calls {{collect()}} knows what 
> type of array should be allocated, which is the case in the MLlib wrappers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10224) BlockGenerator may lost data in the last block

2015-09-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10224.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> BlockGenerator may lost data in the last block
> --
>
> Key: SPARK-10224
> URL: https://issues.apache.org/jira/browse/SPARK-10224
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> There is a race condition in BlockGenerator that will lost data in the last 
> block. See my PR for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905327#comment-14905327
 ] 

Ian edited comment on SPARK-10741 at 9/23/15 9:36 PM:
--

Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 





was (Author: ianlcsd):

Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(hhttps://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905327#comment-14905327
 ] 

Ian commented on SPARK-10741:
-


Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(hhttps://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10784) Flaky Streaming ML test umbrella

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10784:
-

 Summary: Flaky Streaming ML test umbrella
 Key: SPARK-10784
 URL: https://issues.apache.org/jira/browse/SPARK-10784
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, Streaming
Reporter: Joseph K. Bradley


This is an umbrella for collecting reports of flaky Streaming ML tests in Scala 
or Python.  To report a failure, please check links for duplicates.  If it is a 
new failure, please create a JIRA and link/comment about it here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10086:
--
Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2a) just the first batch, (2b) just the second batch.  Here's the 
code:

Here is a reproduction of the failure which avoids Streaming altogether.  (It 
is the same as the code I have linked in the comment above, except that I run 
each step manually rather than going through streaming.)

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 123, in _eventually
raise lastValue
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 114, in _eventually
lastValue = condition()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1144, in condition
self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
? 

+ [[0, 1, 1], [1, 0, 1]]
?  +++   ^


--
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating this test failure:

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41081/console]

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2a) just the first batch, (2b) just the second batch.  Here's the 
code:
[https://github.com/jkbradley/spark/blob/d3eedb7773b9e15595cbc79c009fe932703c0b11/examples/src/main/python/mllib/streaming_kmeans.py]

Disturbingly, only (2b) produced the failure, indicating that batch 2 was 
processed and 1 was not.  [~tdas] says queueStream should have consistency 
guarantees and that should not happen.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

*Current status: Not sure happened*

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 

[jira] [Updated] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10765:
-
Target Version/s: 1.6.0

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-23 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904913#comment-14904913
 ] 

Narine Kokhlikyan commented on SPARK-9325:
--

Hi everyone,

how far are are you with this feature. It would be nice to have show(df$Age) 
like R does.

Thanks,
Narine

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10763:


Assignee: Apache Spark

> Update Java MLLIB/ML tests to use simplified dataframe construction
> ---
>
> Key: SPARK-10763
> URL: https://issues.apache.org/jira/browse/SPARK-10763
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now 
> have an easier way to create dataframes from local Java lists. Lets update 
> the tests to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10728:
--
Affects Version/s: (was: 1.6.0)

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904979#comment-14904979
 ] 

Xiangrui Meng commented on SPARK-10728:
---

This is still an issue, though not high-priority. We should keep this JIRA open 
if we want to fix it sometime.

> Failed to set Jenkins Identity header on email.
> ---
>
> Key: SPARK-10728
> URL: https://issues.apache.org/jira/browse/SPARK-10728
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Xiangrui Meng
>Assignee: shane knapp
>Priority: Trivial
>
> Saw couple Jenkins build failures due to "Failed to set Jenkins Identity 
> header on email", e.g.,
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull
> {code}
> [error] running 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
>  -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
> -Phive-thriftserver test ; received return code 143
> Build step 'Execute shell' marked build as failure
> Archiving artifacts
> Recording test results
> ERROR: Failed to set Jenkins Identity header on email.
> java.lang.NullPointerException
>   at 
> org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
>   at 
> jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
>   at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
>   at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
>   at hudson.tasks.MailSender.createMail(MailSender.java:178)
>   at hudson.tasks.MailSender.run(MailSender.java:107)
>   at hudson.tasks.Mailer.perform(Mailer.java:141)
>   at 
> hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
>   at hudson.model.Build$BuildExecution.post2(Build.java:185)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
>   at hudson.model.Run.execute(Run.java:1766)
>   at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:408)
> Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
> Finished: FAILURE
> {code}
> The workaround documented on 
> https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 
> 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10780:
-

 Summary: Set initialModel in KMeans in Pipelines API
 Key: SPARK-10780
 URL: https://issues.apache.org/jira/browse/SPARK-10780
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905279#comment-14905279
 ] 

Andrew Or commented on SPARK-10474:
---

Re-opening this because I found the real cause for this issue (which turns out 
to be the same as SPARK-10733).

In TungstenAggregate's prepare method, we only reserve a page. However, when we 
switch to sort-based aggregation, we try to acquire a page AND a pointer array. 
This means even if we spill the page we currently have, there's a reasonable 
chance that we still can't allocate for the pointer array.

The temporary solution for 1.5.1 would be to simply not track the pointer array 
(we already don't track it in some other places), which should be safe because 
of the generous shuffle safety fraction of 20%.

I'll submit another patch shortly.

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 

[jira] [Assigned] (SPARK-10622) Race condition between scheduler and YARN executor status update

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10622:


Assignee: (was: Apache Spark)

> Race condition between scheduler and YARN executor status update
> 
>
> Key: SPARK-10622
> URL: https://issues.apache.org/jira/browse/SPARK-10622
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This is a follow up to SPARK-8167. From the comment left in the code:
> {quote}
> TODO there's a race condition where while we are querying the 
> ApplicationMaster for the executor loss reason, there is the potential that 
> tasks will be scheduled on the executor that failed. We should fix this by 
> having this onDisconnected event also "blacklist" executors so that tasks are 
> not assigned to them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10622) Race condition between scheduler and YARN executor status update

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10622:


Assignee: Apache Spark

> Race condition between scheduler and YARN executor status update
> 
>
> Key: SPARK-10622
> URL: https://issues.apache.org/jira/browse/SPARK-10622
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Critical
>
> This is a follow up to SPARK-8167. From the comment left in the code:
> {quote}
> TODO there's a race condition where while we are querying the 
> ApplicationMaster for the executor loss reason, there is the potential that 
> tasks will be scheduled on the executor that failed. We should fix this by 
> having this onDisconnected event also "blacklist" executors so that tasks are 
> not assigned to them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10622) Race condition between scheduler and YARN executor status update

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905285#comment-14905285
 ] 

Apache Spark commented on SPARK-10622:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8887

> Race condition between scheduler and YARN executor status update
> 
>
> Key: SPARK-10622
> URL: https://issues.apache.org/jira/browse/SPARK-10622
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This is a follow up to SPARK-8167. From the comment left in the code:
> {quote}
> TODO there's a race condition where while we are querying the 
> ApplicationMaster for the executor loss reason, there is the potential that 
> tasks will be scheduled on the executor that failed. We should fix this by 
> having this onDisconnected event also "blacklist" executors so that tasks are 
> not assigned to them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-10767:

Description: Namely "." shows up in some places in the template when using 
the param docstring and not in others

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905299#comment-14905299
 ] 

Andrew Or commented on SPARK-10474:
---

Alright, I think this should fix it for real:
https://github.com/apache/spark/pull/

[~jameszhouyi] [~xjrk] please test it out one last time. Thanks for following 
through on this issue!

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--

[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-09-23 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905409#comment-14905409
 ] 

Rick Hillegas commented on SPARK-8616:
--

The following email thread may be useful for understanding this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/column-identifiers-in-Spark-SQL-td14280.html

Thanks,
-Rick

> SQLContext doesn't handle tricky column names when loading from JDBC
> 
>
> Key: SPARK-8616
> URL: https://issues.apache.org/jira/browse/SPARK-8616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
>Reporter: Gergely Svigruha
>
> Reproduce:
>  - create a table in a relational database (in my case sqlite) with a column 
> name containing a space:
>  CREATE TABLE my_table (id INTEGER, "tricky column" TEXT);
>  - try to create a DataFrame using that table:
> sqlContext.read.format("jdbc").options(Map(
>   "url" -> "jdbs:sqlite:...",
>   "dbtable" -> "my_table")).load()
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> column: tricky)
> According to the SQL spec this should be valid:
> http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10668:
--
Shepherd: Xiangrui Meng

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10733.
---
Resolution: Duplicate

Looks like this is a duplicate of SPARK-10474 after all. I'm closing this...

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-10474:
---

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905291#comment-14905291
 ] 

Apache Spark commented on SPARK-10474:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905303#comment-14905303
 ] 

holdenk commented on SPARK-10767:
-

Updated the description, sorry about that. This comes from 
https://github.com/apache/spark/pull/8214#issuecomment-142486825 

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905322#comment-14905322
 ] 

Joseph K. Bradley commented on SPARK-10767:
---

Oh, yeah, that is annoying.  + 1

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905329#comment-14905329
 ] 

holdenk commented on SPARK-10767:
-

My plan was to wait for that PR to go in and then do this as a quick cleanup 
after (since as pointed out in the original PR its a pretty unrelated change).

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4885) Enable fetched blocks to exceed 2 GB

2015-09-23 Thread Sai Nishanth Parepally (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905338#comment-14905338
 ] 

Sai Nishanth Parepally commented on SPARK-4885:
---

I am using spark 1.4.1 and facing the same issue, what is the work around for 
the 2 GB limitation?

Here is the error 
15/09/23 13:19:29 WARN TransportChannelHandler: Exception in connection from 
XX.XX.XX.XX:X
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:148)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1189)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1198)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:190)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:480)
at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:302)
at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

> Enable fetched blocks to exceed 2 GB
> 
>
> Key: SPARK-4885
> URL: https://issues.apache.org/jira/browse/SPARK-4885
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> {code}
> 14/12/18 09:53:13 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[handle-message-executor-12,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at 
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at 
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at 
> com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at 
> com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
> at 
> com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
> at 
> 

[jira] [Created] (SPARK-10783) Do track the pointer array in UnsafeInMemorySorter

2015-09-23 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10783:
-

 Summary: Do track the pointer array in UnsafeInMemorySorter
 Key: SPARK-10783
 URL: https://issues.apache.org/jira/browse/SPARK-10783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 1.5.1
Reporter: Andrew Or
Priority: Blocker


SPARK-10474 (https://github.com/apache/spark/pull/) removed the pointer 
array tracking because `TungstenAggregate` would fail under memory pressure. 
However, this is somewhat of a hack that we should fix in the right way in 
1.6.0 to ensure we don't OOM because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905327#comment-14905327
 ] 

Ian edited comment on SPARK-10741 at 9/23/15 9:45 PM:
--

Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared to be that the execute call at 
(https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 





was (Author: ianlcsd):
Yes, going through all rules when resolve Sort on Aggregate is a correct 
approach.

The main problem appeared that the execute call at 
(https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L571)
 is resolving to different attribute ids, and causing confusion at  
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L592-L611.

just for me to understand a bit more:
the second approach you are proposing is to remove the confusion by changing 
how ids are resolved in Analyzer.scala#L571, right? 




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9715.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8675
[https://github.com/apache/spark/pull/8675]

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 1.6.0
>
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10686.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8836
[https://github.com/apache/spark/pull/8836]

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10086:
--
Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2) just the first batch, (3) just the second batch, and (4) neither 
batch.  Here is a reproduction of the failure which avoids Streaming altogether.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 123, in _eventually
raise lastValue
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 114, in _eventually
lastValue = condition()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1144, in condition
self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
? 

+ [[0, 1, 1], [1, 0, 1]]
?  +++   ^


--
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2a) just the first batch, (2b) just the second batch.  Here's the 
code:

Here is a reproduction of the failure which avoids Streaming altogether.  (It 
is the same as the code I have linked in the comment above, except that I run 
each step manually rather than going through streaming.)

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]

[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10086:
--
Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2) just the first batch, (3) just the second batch, and (4) neither 
batch.  Here is code which avoids Streaming altogether to identify what batches 
were processed.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
==
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1147, in test_trainOn_predictOn
self._eventually(condition, catch_assertions=True)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 123, in _eventually
raise lastValue
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 114, in _eventually
lastValue = condition()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
 line 1144, in condition
self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
? 

+ [[0, 1, 1], [1, 0, 1]]
?  +++   ^


--
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating test failures in StreamingKMeans in PySpark. 
(See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then 
tests on those same 2 batches.  It fails here: 
[https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 
batches, (2) just the first batch, (3) just the second batch, and (4) neither 
batch.  Here is a reproduction of the failure which avoids Streaming altogether.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
### EXPECTED

[0, 1, 1]   
[1, 0, 1]

### Skip batch 0

[1, 0, 0]
[0, 1, 0]

### Skip batch 1

[0, 1, 1]
[1, 0, 1]

### Skip both batches  (This is what we see in the test 
failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the 
StreamingKMeans algorithm 

[jira] [Created] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed

2015-09-23 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-10781:
-

 Summary: Allow certain number of failed tasks and allow job to 
succeed
 Key: SPARK-10781
 URL: https://issues.apache.org/jira/browse/SPARK-10781
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Thomas Graves


MapReduce has this config mapreduce.map.failures.maxpercent and 
mapreduce.reduce.failures.maxpercent which allows for a certain percent of 
tasks to fail but the job to still succeed.  

This could be a useful feature in Spark also if a job doesn't need all the 
tasks to be successful.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905166#comment-14905166
 ] 

Yin Huai commented on SPARK-10741:
--

The second options sounds better.

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

2015-09-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905180#comment-14905180
 ] 

Sean Owen commented on SPARK-10782:
---

Looks like you're right, feel free to make a PR with the correct example for 
drop_duplicates

> Duplicate examples for drop_duplicates and DropDuplicates
> -
>
> Key: SPARK-10782
> URL: https://issues.apache.org/jira/browse/SPARK-10782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Asoka Diggs
>Priority: Trivial
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates 
> and drop_duplicates are identical with each other.  It appears that the 
> example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905196#comment-14905196
 ] 

Joseph K. Bradley commented on SPARK-10767:
---

What issues specifically?

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905230#comment-14905230
 ] 

Joseph K. Bradley commented on SPARK-10413:
---

For API, I think my main question is whether predict() should take strong types 
(Vector, etc.) and/or Rows.  I prefer supporting strong types first (as you are 
doing) since we could add support for Rows later on (although there could be 
difficult questions about missing schema for Scala/Java).

For raw & probability, I would again vote for just making those public.  But 
that could be done at a later time.

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

2015-09-23 Thread Asoka Diggs (JIRA)
Asoka Diggs created SPARK-10782:
---

 Summary: Duplicate examples for drop_duplicates and DropDuplicates
 Key: SPARK-10782
 URL: https://issues.apache.org/jira/browse/SPARK-10782
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Asoka Diggs
Priority: Trivial


In documentation for pyspark.sql, the source code examples for DropDuplicates 
and drop_duplicates are identical with each other.  It appears that the example 
for DropDuplicates was copy/pasted for drop_duplicates and not edited.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-09-23 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904471#comment-14904471
 ] 

Mohamed Baddar edited comment on SPARK-9836 at 9/23/15 8:39 PM:


Thanks a lot [~mengxr] , i will try one of the starter tasks , but seems they 
are all taken , if so , what should i do next ?


was (Author: mbaddar):
Thanks a lot , i will try one of the starter tasks , but seems they are all 
taken , if so , what should i do next ?

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10787:
---
Description: 
In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow 
(http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.

  was:
In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow, 
Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.


> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
> ClosureCleaner#ensureSerializable() resulted in OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2015-09-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904276#comment-14904276
 ] 

Yanbo Liang commented on SPARK-10413:
-

[~mengxr] 
I think to support prediction on single instance for PredictionModel and its 
subclass is not complex, we just make predict() public and add test cases.
In the case of other Model subclass we should add predict function for single 
instance prediction. And then the transform function will use predict functions 
as UDF.
And one issues that we should discuss is that after we make predict as public,  
shall we need to make other functions such as predictRaw, predictProbability, 
etc. public for single instance?
 

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10786:


Assignee: (was: Apache Spark)

> SparkSQLCLIDriver should take the whole statement to generate the 
> CommandProcessor
> --
>
> Key: SPARK-10786
> URL: https://issues.apache.org/jira/browse/SPARK-10786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Priority: Minor
>
> In the now implementation of SparkSQLCLIDriver.scala: 
> *val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
> hconf)*
> *CommandProcessorFactory* only take the first token of the statement, and 
> this will be hard to diff the statement *delete jar xxx* and *delete from 
> xxx*.
> So maybe it's better to take the whole statement into the 
> *CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905692#comment-14905692
 ] 

Apache Spark commented on SPARK-10786:
--

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8895

> SparkSQLCLIDriver should take the whole statement to generate the 
> CommandProcessor
> --
>
> Key: SPARK-10786
> URL: https://issues.apache.org/jira/browse/SPARK-10786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Priority: Minor
>
> In the now implementation of SparkSQLCLIDriver.scala: 
> *val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
> hconf)*
> *CommandProcessorFactory* only take the first token of the statement, and 
> this will be hard to diff the statement *delete jar xxx* and *delete from 
> xxx*.
> So maybe it's better to take the whole statement into the 
> *CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10786:


Assignee: Apache Spark

> SparkSQLCLIDriver should take the whole statement to generate the 
> CommandProcessor
> --
>
> Key: SPARK-10786
> URL: https://issues.apache.org/jira/browse/SPARK-10786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>Priority: Minor
>
> In the now implementation of SparkSQLCLIDriver.scala: 
> *val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
> hconf)*
> *CommandProcessorFactory* only take the first token of the statement, and 
> this will be hard to diff the statement *delete jar xxx* and *delete from 
> xxx*.
> So maybe it's better to take the whole statement into the 
> *CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-10474.
-
Resolution: Fixed

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Resolved] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10692.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.1

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.5.1, 1.6.0
>
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905729#comment-14905729
 ] 

Tathagata Das commented on SPARK-10086:
---

You could use a maintain a counter for the number of batches completed. That is 
a foreachRDD can increment a counter. And the testing code should wait for the 
counter to reach 2 before checking for the model. Alternatively, the check 
should be in an eventually loop.

> Flaky StreamingKMeans test in PySpark
> -
>
> Key: SPARK-10086
> URL: https://issues.apache.org/jira/browse/SPARK-10086
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark, Streaming, Tests
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> Here's a report on investigating test failures in StreamingKMeans in PySpark. 
> (See Jenkins links below.)
> It is a StreamingKMeans test which trains on a DStream with 2 batches and 
> then tests on those same 2 batches.  It fails here: 
> [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]
> I recreated the same test, with variants training on: (1) the original 2 
> batches, (2) just the first batch, (3) just the second batch, and (4) neither 
> batch.  Here is code which avoids Streaming altogether to identify what 
> batches were processed.
> {code}
> from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
> batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
> batches = [sc.parallelize(batch) for batch in batches]
> stkm = StreamingKMeans(decayFactor=0.0, k=2)
> stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
> # Train
> def update(rdd):
> stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
> # Remove one or both of these lines to test skipping batches.
> update(batches[0])
> update(batches[1])
> # Test
> def predict(rdd):
> return stkm._model.predict(rdd)
> predict(batches[0]).collect()
> predict(batches[1]).collect()
> {code}
> *Results*:
> {code}
> ### EXPECTED
> [0, 1, 1] 
>   
> [1, 0, 1]
> ### Skip batch 0
> [1, 0, 0]
> [0, 1, 0]
> ### Skip batch 1
> [0, 1, 1]
> [1, 0, 1]
> ### Skip both batches  (This is what we see in the test 
> failures.)
> [0, 1, 1]
> [0, 0, 0]
> {code}
> Skipping both batches reproduces the failure.  There is no randomness in the 
> StreamingKMeans algorithm (since initial centers are fixed, not randomized).
> CC: [~tdas] [~freeman-lab] [~mengxr]
> Failure message:
> {code}
> ==
> FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
> Test that prediction happens on the updated model.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1147, in test_trainOn_predictOn
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 123, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 114, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1144, in condition
> self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
> AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
> First differing element 1:
> [0, 0, 0]
> [1, 0, 1]
> - [[0, 1, 1], [0, 0, 0]]
> ? 
> + [[0, 1, 1], [1, 0, 1]]
> ?  +++   ^
> --
> Ran 62 tests in 164.188s
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2015-09-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905731#comment-14905731
 ] 

Tathagata Das commented on SPARK-10086:
---

Actually never mind, its already in eventually. The default timeout is 30 
seconds. Then I dont get why this is failing. Could it be a thread race 
condition visibility issue?

> Flaky StreamingKMeans test in PySpark
> -
>
> Key: SPARK-10086
> URL: https://issues.apache.org/jira/browse/SPARK-10086
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark, Streaming, Tests
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> Here's a report on investigating test failures in StreamingKMeans in PySpark. 
> (See Jenkins links below.)
> It is a StreamingKMeans test which trains on a DStream with 2 batches and 
> then tests on those same 2 batches.  It fails here: 
> [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]
> I recreated the same test, with variants training on: (1) the original 2 
> batches, (2) just the first batch, (3) just the second batch, and (4) neither 
> batch.  Here is code which avoids Streaming altogether to identify what 
> batches were processed.
> {code}
> from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
> batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
> batches = [sc.parallelize(batch) for batch in batches]
> stkm = StreamingKMeans(decayFactor=0.0, k=2)
> stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
> # Train
> def update(rdd):
> stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
> # Remove one or both of these lines to test skipping batches.
> update(batches[0])
> update(batches[1])
> # Test
> def predict(rdd):
> return stkm._model.predict(rdd)
> predict(batches[0]).collect()
> predict(batches[1]).collect()
> {code}
> *Results*:
> {code}
> ### EXPECTED
> [0, 1, 1] 
>   
> [1, 0, 1]
> ### Skip batch 0
> [1, 0, 0]
> [0, 1, 0]
> ### Skip batch 1
> [0, 1, 1]
> [1, 0, 1]
> ### Skip both batches  (This is what we see in the test 
> failures.)
> [0, 1, 1]
> [0, 0, 0]
> {code}
> Skipping both batches reproduces the failure.  There is no randomness in the 
> StreamingKMeans algorithm (since initial centers are fixed, not randomized).
> CC: [~tdas] [~freeman-lab] [~mengxr]
> Failure message:
> {code}
> ==
> FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
> Test that prediction happens on the updated model.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1147, in test_trainOn_predictOn
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 123, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 114, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1144, in condition
> self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
> AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
> First differing element 1:
> [0, 0, 0]
> [1, 0, 1]
> - [[0, 1, 1], [0, 0, 0]]
> ? 
> + [[0, 1, 1], [1, 0, 1]]
> ?  +++   ^
> --
> Ran 62 tests in 164.188s
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Ted Yu (JIRA)
Ted Yu created SPARK-10787:
--

 Summary: Reset ObjectOutputStream more often to prevent OOME
 Key: SPARK-10787
 URL: https://issues.apache.org/jira/browse/SPARK-10787
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu


In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow, 
Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10787:


Assignee: (was: Apache Spark)

> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow, Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in 
> OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905769#comment-14905769
 ] 

Apache Spark commented on SPARK-10787:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8896

> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow, Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in 
> OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10787) Reset ObjectOutputStream more often to prevent OOME

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10787:


Assignee: Apache Spark

> Reset ObjectOutputStream more often to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Apache Spark
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow, Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in 
> OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9798:
-
Shepherd: Joseph K. Bradley
Target Version/s: 1.6.0

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10731.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 1.5.1
   1.6.0

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>Assignee: Reynold Xin
>  Labels: pyspark
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905573#comment-14905573
 ] 

Joseph K. Bradley commented on SPARK-9798:
--

This is a very small task, so I don't think it should be split.  Perhaps 
another doc or starter task?  Thanks!

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10741:


Assignee: (was: Apache Spark)

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905479#comment-14905479
 ] 

Apache Spark commented on SPARK-10741:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8889

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10741:


Assignee: Apache Spark

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Apache Spark
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10699) Support checkpointInterval can be disabled

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10699.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8820
[https://github.com/apache/spark/pull/8820]

> Support checkpointInterval can be disabled
> --
>
> Key: SPARK-10699
> URL: https://issues.apache.org/jira/browse/SPARK-10699
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently use can set checkpointInterval to specify how often should the 
> cache be checkpointed. But we also need the function that users can disable 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10785) Scale QuantileDiscretizer using distributed binning

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10785:
-

 Summary: Scale QuantileDiscretizer using distributed binning
 Key: SPARK-10785
 URL: https://issues.apache.org/jira/browse/SPARK-10785
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


[SPARK-10064] improves binning in decision trees by distributing the 
computation.  QuantileDiscretizer should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5890) Add QuantileDiscretizer

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5890:
-
Description: 
A `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned categorical features.

{code}
val fd = new QuantileDiscretizer()
  .setInputCol("age")
  .setNumBins(32)
  .setOutputCol("ageBins")
{code}

This should an automatic feature discretizer, which uses a simple algorithm 
like approximate quantiles to discretize features. It should set the ML 
attribute correctly in the output column.

  was:
A `FeatureDiscretizer` takes a column with continuous features and outputs a 
column with binned categorical features.

{code}
val fd = new FeatureDiscretizer()
  .setInputCol("age")
  .setNumBins(32)
  .setOutputCol("ageBins")
{code}

This should an automatic feature discretizer, which uses a simple algorithm 
like approximate quantiles to discretize features. It should set the ML 
attribute correctly in the output column.


> Add QuantileDiscretizer
> ---
>
> Key: SPARK-5890
> URL: https://issues.apache.org/jira/browse/SPARK-5890
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> A `QuantileDiscretizer` takes a column with continuous features and outputs a 
> column with binned categorical features.
> {code}
> val fd = new QuantileDiscretizer()
>   .setInputCol("age")
>   .setNumBins(32)
>   .setOutputCol("ageBins")
> {code}
> This should an automatic feature discretizer, which uses a simple algorithm 
> like approximate quantiles to discretize features. It should set the ML 
> attribute correctly in the output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5890) Add QuantileDiscretizer

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5890:
-
Summary: Add QuantileDiscretizer  (was: Add FeatureDiscretizer)

> Add QuantileDiscretizer
> ---
>
> Key: SPARK-5890
> URL: https://issues.apache.org/jira/browse/SPARK-5890
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> A `FeatureDiscretizer` takes a column with continuous features and outputs a 
> column with binned categorical features.
> {code}
> val fd = new FeatureDiscretizer()
>   .setInputCol("age")
>   .setNumBins(32)
>   .setOutputCol("ageBins")
> {code}
> This should an automatic feature discretizer, which uses a simple algorithm 
> like approximate quantiles to discretize features. It should set the ML 
> attribute correctly in the output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9841) Params.clear needs to be public

2015-09-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9841:
-
Shepherd: Joseph K. Bradley
Assignee: holdenk

> Params.clear needs to be public
> ---
>
> Key: SPARK-9841
> URL: https://issues.apache.org/jira/browse/SPARK-9841
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>
> It is currently impossible to clear Param values once set.  It would be 
> helpful to be able to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8115) Remove TestData

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8115:
---
Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> Remove TestData
> ---
>
> Key: SPARK-8115
> URL: https://issues.apache.org/jira/browse/SPARK-8115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Andrew Or
>Priority: Minor
>
> TestData was from the era when we didn't have easy ways to generate test 
> datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
> the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10538:

Target Version/s: 1.5.2  (was: 1.5.1)

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej BryƄski
>Assignee: Davies Liu
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905653#comment-14905653
 ] 

Apache Spark commented on SPARK-10692:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8892

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10724:


Assignee: Apache Spark

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Apache Spark
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10724:


Assignee: (was: Apache Spark)

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905658#comment-14905658
 ] 

Apache Spark commented on SPARK-10724:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/8893

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6028.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10692:

Priority: Critical  (was: Blocker)

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10043) Add window functions into SparkR

2015-09-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10043:

Target Version/s: 1.6.0  (was: 1.5.1, 1.6.0)

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10741:
-
Assignee: Wenchen Fan

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-09-23 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-10786:


 Summary: SparkSQLCLIDriver should take the whole statement to 
generate the CommandProcessor
 Key: SPARK-10786
 URL: https://issues.apache.org/jira/browse/SPARK-10786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: SaintBacchus
Priority: Minor


In the now implementation of SparkSQLCLIDriver.scala: 
*val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), 
hconf)*

*CommandProcessorFactory* only take the first token of the statement, and this 
will be hard to diff the statement *delete jar xxx* and *delete from xxx*.
So maybe it's better to take the whole statement into the 
*CommandProcessorFactory*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10709:


Assignee: (was: Apache Spark)

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:442)
>   at 
> 

[jira] [Commented] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905803#comment-14905803
 ] 

Apache Spark commented on SPARK-10709:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/8899

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> 

[jira] [Assigned] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10709:


Assignee: Apache Spark

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> 

[jira] [Commented] (SPARK-10770) SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row

2015-09-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905817#comment-14905817
 ] 

Apache Spark commented on SPARK-10770:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8900

> SparkPlan.executeCollect/executeTake should return InternalRow rather than 
> external Row
> ---
>
> Key: SPARK-10770
> URL: https://issues.apache.org/jira/browse/SPARK-10770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-09-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10788:
-

 Summary: Decision Tree duplicates bins for unordered categorical 
features
 Key: SPARK-10788
 URL: https://issues.apache.org/jira/browse/SPARK-10788
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
copy of each split. E.g., if there are 3 categories A, B, C, then we should 
consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation 
since the spark.mllib implementation will be removed before long (and will 
instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2