[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available

2015-09-21 Thread Madhusudanan Kandasamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900277#comment-14900277
 ] 

Madhusudanan Kandasamy commented on SPARK-10644:


Standalone cluster manager launches executors during application's 
registration. So when you say Number of executors: 63, do you mean total number 
of cores available across all the workers?

Can you give the values for spark.executor.cores and spark.cores.max. 

> Applications wait even if free executors are available
> --
>
> Key: SPARK-10644
> URL: https://issues.apache.org/jira/browse/SPARK-10644
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
> Environment: RHEL 6.5 64 bit
>Reporter: Balagopal Nair
>
> Number of workers: 21
> Number of executors: 63
> Steps to reproduce:
> 1. Run 4 jobs each with max cores set to 10
> 2. The first 3 jobs run with 10 each. (30 executors consumed so far)
> 3. The 4 th job waits even though there are 33 idle executors.
> The reason is that a job will not get executors unless 
> the total number of EXECUTORS in use < the number of WORKERS
> If there are executors available, resources should be allocated to the 
> pending job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method

2015-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900362#comment-14900362
 ] 

Sean Owen commented on SPARK-10723:
---

You can call {{RDD.isEmpty}} to check if it's empty. I don't think this is 
worth adding another API method for.

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10711) Do not assume spark.submit.deployMode is always set

2015-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10711.
-
   Resolution: Fixed
 Assignee: Hossein Falaki
Fix Version/s: 1.6.0

> Do not assume spark.submit.deployMode is always set
> ---
>
> Key: SPARK-10711
> URL: https://issues.apache.org/jira/browse/SPARK-10711
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> in RRDD.createRProcess() we call RUtils.sparkRPackagePath(), which assumes 
> "... that Spark properties `spark.master` and `spark.submit.deployMode` are 
> set."
> It is better to assume safe defaults if they are not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10740:

Description: 
We should only push down deterministic filter condition to set operator.
For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and 
we may get 1,3 for the left side and 2,4 for the right side, then the result 
should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and 
the result would be 1,3,1,3.
For Intersect, let's say there is a non-deterministic condition with a 0.5 
possibility to accept a row and we have a row that presents in both sides of an 
Intersect. Once we push down this condition, the possibility to accept this row 
will be 0.25.
For Except, let's say there is a row that presents in both sides of an Except. 
This row should not be in the final output. However, if we pushdown a 
non-deterministic condition, it is possible that this row is rejected from one 
side and then we output a row that should not be a part of the result.

 We should only push down deterministic projection to Union.


> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side and the result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
  @Test
  def testParquetOrderBy() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedOrderBy).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}

  was:
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val filedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(filedOrderBy).collect

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}


> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 

[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
  def testParquetOrderBy() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedOrderBy).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}

  was:
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
  @Test
  def testParquetOrderBy() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedOrderBy).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}


> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], 

[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10742:


Assignee: Apache Spark  (was: Tathagata Das)

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10743:
---

 Summary: keep the name of expression if possible when do cast
 Key: SPARK-10743
 URL: https://issues.apache.org/jira/browse/SPARK-10743
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901901#comment-14901901
 ] 

Sandy Ryza commented on SPARK-10739:


I recall there was a JIRA similar to this that avoided killing the application 
when we reached a certain number of executor failures.  However, IIUC, this is 
about something different: deciding whether to have YARN restart the 
application when it fails.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901959#comment-14901959
 ] 

Yi Zhou commented on SPARK-10733:
-

1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer



> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901959#comment-14901959
 ] 

Yi Zhou edited comment on SPARK-10733 at 9/22/15 5:18 AM:
--

1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/\*:/usr/lib/hadoop/client/\*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer




was (Author: jameszhouyi):
1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer



> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10731:


Assignee: Apache Spark

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>Assignee: Apache Spark
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901983#comment-14901983
 ] 

Apache Spark commented on SPARK-10731:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8852

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10731:


Assignee: (was: Apache Spark)

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901944#comment-14901944
 ] 

Saisai Shao commented on SPARK-10739:
-

[~srowen] I'm not sure what you mentioned about the previous JIRA is exactly 
the same thing as I mentioned here. 

This JIRA is trying to let YARN know that this is a long running service and 
ignore out of window failures (YARN-611), probably is the different thing you 
mentioned about.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration

2015-09-21 Thread John Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901979#comment-14901979
 ] 

John Chen commented on SPARK-9595:
--

Yes, thanks a lot.

> Adding API to SparkConf for kryo serializers registration
> -
>
> Key: SPARK-9595
> URL: https://issues.apache.org/jira/browse/SPARK-9595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1
>Reporter: John Chen
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently SparkConf has a registerKryoClasses API for kryo registration. 
> However, this only works when you register classes. If you want to register 
> customized kryo serializers, you'll have to extend the KryoSerializer class 
> and write some codes.
> This is not only very inconvenient, but also require the registration to be 
> done in compile-time, which is not always possible. Thus, I suggest another 
> API to SparkConf for registering customized kryo serializers. It could be 
> like this:
> def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-10739:
---

 Summary: Add attempt window for long running Spark application on 
Yarn
 Key: SPARK-10739
 URL: https://issues.apache.org/jira/browse/SPARK-10739
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Saisai Shao
Priority: Minor


Currently Spark on Yarn uses max attempts to control the failure number, if 
application's failure number reaches to the max attempts, application will not 
be recovered by RM, it is not very for long running applications, since it will 
easily exceed the max number, also setting a very large max attempts will hide 
the real problem.

So here introduce an attempt window to control the application attempt times, 
this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to 
support long running application, it is quite useful for Spark Streaming, Spark 
shell like applications.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901656#comment-14901656
 ] 

Apache Spark commented on SPARK-10739:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8857

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10739:


Assignee: (was: Apache Spark)

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10740:

Description: 
We should only push down deterministic filter condition to set operator.
For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and 
we may get 1,3 for the left side and 2,4 for the right side, then the result 
should be 1,3,2,4. If we push down this filter, we get 1,3 for both side(we 
create a new random object with same seed in each side) and the result would be 
1,3,1,3.
For Intersect, let's say there is a non-deterministic condition with a 0.5 
possibility to accept a row and we have a row that presents in both sides of an 
Intersect. Once we push down this condition, the possibility to accept this row 
will be 0.25.
For Except, let's say there is a row that presents in both sides of an Except. 
This row should not be in the final output. However, if we pushdown a 
non-deterministic condition, it is possible that this row is rejected from one 
side and then we output a row that should not be a part of the result.

 We should only push down deterministic projection to Union.


  was:
We should only push down deterministic filter condition to set operator.
For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and 
we may get 1,3 for the left side and 2,4 for the right side, then the result 
should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and 
the result would be 1,3,1,3.
For Intersect, let's say there is a non-deterministic condition with a 0.5 
possibility to accept a row and we have a row that presents in both sides of an 
Intersect. Once we push down this condition, the possibility to accept this row 
will be 0.25.
For Except, let's say there is a row that presents in both sides of an Except. 
This row should not be in the final output. However, if we pushdown a 
non-deterministic condition, it is possible that this row is rejected from one 
side and then we output a row that should not be a part of the result.

 We should only push down deterministic projection to Union.



> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10310:
---
Assignee: zhichao-li

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Assignee: zhichao-li
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10727) Dataframe count is zero after 'except' operation

2015-09-21 Thread Satish Kolli (JIRA)
Satish Kolli created SPARK-10727:


 Summary: Dataframe count is zero after 'except' operation
 Key: SPARK-10727
 URL: https://issues.apache.org/jira/browse/SPARK-10727
 Project: Spark
  Issue Type: Bug
Reporter: Satish Kolli


Data frame count after the except operation is always returning zero even when 
there is data in the resulting data frame.

{code}
scala> val df1 = sc.parallelize(1 to 5).toDF("V1")
df1: org.apache.spark.sql.DataFrame = [V1: int]

scala> val df2 = sc.parallelize(2 to 5).toDF("V2")
df2: org.apache.spark.sql.DataFrame = [V2: int]

scala> df1.except(df2).show()
+---+
| V1|
+---+
|  1|
+---+


scala> df1.except(df2).count()
res4: Long = 0

scala> df1.except(df2).rdd.count()
res5: Long = 1

scala>
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900562#comment-14900562
 ] 

Yi Zhou commented on SPARK-10474:
-

Hi [~andrewor14]
I found a new error like below after apply this PR. Not sure if some one have 
reported it or I miss other PR relative to below issue.

15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 152.0 
(TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 bytes of 
memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> 

[jira] [Commented] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900583#comment-14900583
 ] 

Apache Spark commented on SPARK-10726:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/8848

> Using dynamic-executor-allocation,When jobs are submitted parallelly, 
> executors will be removed before tasks finish
> ---
>
> Key: SPARK-10726
> URL: https://issues.apache.org/jira/browse/SPARK-10726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9883) Distance to each cluster given a point

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900648#comment-14900648
 ] 

Apache Spark commented on SPARK-9883:
-

User 'BertrandDechoux' has created a pull request for this issue:
https://github.com/apache/spark/pull/8849

> Distance to each cluster given a point
> --
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9883) Distance to each cluster given a point

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9883:
---

Assignee: (was: Apache Spark)

> Distance to each cluster given a point
> --
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10727) Dataframe count is zero after 'except' operation

2015-09-21 Thread Satish Kolli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Kolli updated SPARK-10727:
-
Affects Version/s: 1.5.0

> Dataframe count is zero after 'except' operation
> 
>
> Key: SPARK-10727
> URL: https://issues.apache.org/jira/browse/SPARK-10727
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Satish Kolli
>
> Data frame count after the except operation is always returning zero even 
> when there is data in the resulting data frame.
> {code}
> scala> val df1 = sc.parallelize(1 to 5).toDF("V1")
> df1: org.apache.spark.sql.DataFrame = [V1: int]
> scala> val df2 = sc.parallelize(2 to 5).toDF("V2")
> df2: org.apache.spark.sql.DataFrame = [V2: int]
> scala> df1.except(df2).show()
> +---+
> | V1|
> +---+
> |  1|
> +---+
> scala> df1.except(df2).count()
> res4: Long = 0
> scala> df1.except(df2).rdd.count()
> res5: Long = 1
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish

2015-09-21 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-10726:
-

 Summary: Using dynamic-executor-allocation,When jobs are submitted 
parallelly, executors will be removed before tasks finish
 Key: SPARK-10726
 URL: https://issues.apache.org/jira/browse/SPARK-10726
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: KaiXinXIaoLei
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9835) Iteratively reweighted least squares solver for GLMs

2015-09-21 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900694#comment-14900694
 ] 

Mohamed Baddar commented on SPARK-9835:
---

Can I work on this issue 
Thanks

> Iteratively reweighted least squares solver for GLMs
> 
>
> Key: SPARK-9835
> URL: https://issues.apache.org/jira/browse/SPARK-9835
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-9834, we can implement iteratively reweighted least squares 
> (IRLS) solver for GLMs with other families and link functions. It could 
> provide R-like summary statistics after training, but the number of features 
> cannot be very large, e.g. more than 4096.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError

2015-09-21 Thread Josiah Samuel Sathiadass (JIRA)
Josiah Samuel Sathiadass created SPARK-10725:


 Summary: HiveSparkSubmitSuite fails with NoClassDefFoundError
 Key: SPARK-10725
 URL: https://issues.apache.org/jira/browse/SPARK-10725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk
Reporter: Josiah Samuel Sathiadass
Priority: Minor


HiveSparkSubmitSuite fails with NoClassDefFoundError as follows,

Exception in thread "main" java.lang.NoClassDefFoundError: 
org.apache.spark.sql.hive.test.TestHiveContext

TestHiveContext class is part of sql_hive jar. Since this jar is not part of 
the list of dependencies we encounter this issue.

So by adding the respective jar to the dependency list we should be able to 
successfully run all the tests under HiveSparkSubmitSuite .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10725:


Assignee: (was: Apache Spark)

> HiveSparkSubmitSuite fails with NoClassDefFoundError
> 
>
> Key: SPARK-10725
> URL: https://issues.apache.org/jira/browse/SPARK-10725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk
>Reporter: Josiah Samuel Sathiadass
>Priority: Minor
>
> HiveSparkSubmitSuite fails with NoClassDefFoundError as follows,
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org.apache.spark.sql.hive.test.TestHiveContext
> TestHiveContext class is part of sql_hive jar. Since this jar is not part of 
> the list of dependencies we encounter this issue.
> So by adding the respective jar to the dependency list we should be able to 
> successfully run all the tests under HiveSparkSubmitSuite .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900497#comment-14900497
 ] 

Apache Spark commented on SPARK-10725:
--

User 'josiahsams' has created a pull request for this issue:
https://github.com/apache/spark/pull/8847

> HiveSparkSubmitSuite fails with NoClassDefFoundError
> 
>
> Key: SPARK-10725
> URL: https://issues.apache.org/jira/browse/SPARK-10725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk
>Reporter: Josiah Samuel Sathiadass
>Priority: Minor
>
> HiveSparkSubmitSuite fails with NoClassDefFoundError as follows,
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org.apache.spark.sql.hive.test.TestHiveContext
> TestHiveContext class is part of sql_hive jar. Since this jar is not part of 
> the list of dependencies we encounter this issue.
> So by adding the respective jar to the dependency list we should be able to 
> successfully run all the tests under HiveSparkSubmitSuite .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10725:


Assignee: Apache Spark

> HiveSparkSubmitSuite fails with NoClassDefFoundError
> 
>
> Key: SPARK-10725
> URL: https://issues.apache.org/jira/browse/SPARK-10725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk
>Reporter: Josiah Samuel Sathiadass
>Assignee: Apache Spark
>Priority: Minor
>
> HiveSparkSubmitSuite fails with NoClassDefFoundError as follows,
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org.apache.spark.sql.hive.test.TestHiveContext
> TestHiveContext class is part of sql_hive jar. Since this jar is not part of 
> the list of dependencies we encounter this issue.
> So by adding the respective jar to the dependency list we should be able to 
> successfully run all the tests under HiveSparkSubmitSuite .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10726:


Assignee: Apache Spark

> Using dynamic-executor-allocation,When jobs are submitted parallelly, 
> executors will be removed before tasks finish
> ---
>
> Key: SPARK-10726
> URL: https://issues.apache.org/jira/browse/SPARK-10726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: KaiXinXIaoLei
>Assignee: Apache Spark
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10726:


Assignee: (was: Apache Spark)

> Using dynamic-executor-allocation,When jobs are submitted parallelly, 
> executors will be removed before tasks finish
> ---
>
> Key: SPARK-10726
> URL: https://issues.apache.org/jira/browse/SPARK-10726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Jon Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901609#comment-14901609
 ] 

Jon Buffington commented on SPARK-5569:
---

Cody,

We were able to work around the issue by destructuring the OffsetRange type 
into its parts (i.e., string, long, ints). Below is our flow for reference. Our 
only concern is whether the transform operation runs at the driver as we have 
use a singleton on the driver to persist the last captured offsets. Can you 
confirm that the stream transform runs at the driver?

{noformat}
// Recover or create the stream.
val ssc = StreamingContext.getOrCreate(checkpointPath, () => {
  createStreamingContext(checkpointPath)
})
...
def createStreamingContext(checkpointPath: String): StreamingContext = {
  val ssc = new StreamingContext(SparkConf,  Minutes(1))
  ssc.checkpoint(checkpointPath)
  createStream(ssc)
  ssc
}
...

def createStream((ssc: StreamingContext): Unit = {
...
  KafkaUtils.createDirectStream[K, V, KD, VD, R](ssc, kafkaParams, fromOffsets, 
messageHandler)
// "Note that the typecast to HasOffsetRanges will only succeed if it is
// done in the first method called on the directKafkaStream, not later down
// a chain of methods. You can use transform() instead of foreachRDD() as
// your first method call in order to access offsets, then call further
// Spark methods. However, be aware that the one-to-one mapping between RDD
// partition and Kafka partition does not remain after any methods that
// shuffle or repartition, e.g. reduceByKey() or window()."
.transform { rdd =>
  ...
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  Ranger.captureOffsetRanges(offsetRanges) // De-structure the OffsetRange 
type into its parts and save in a singleton.
  ...
  rdd
}
...
}
{noformat}


> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at 

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901632#comment-14901632
 ] 

Yin Huai commented on SPARK-10474:
--

[~jameszhouyi] [~xjrk] What is your workloads? How large is the data (if the 
data is generated, can you share the script for data generation)? What is your 
query? And, did you set any spark conf?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901774#comment-14901774
 ] 

Apache Spark commented on SPARK-10310:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8860

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Assignee: zhichao-li
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10744) parser error (constant * column is null interpreted as constant * boolean)

2015-09-21 Thread N Campbell (JIRA)
N Campbell created SPARK-10744:
--

 Summary: parser error (constant * column is null interpreted as 
constant * boolean)
 Key: SPARK-10744
 URL: https://issues.apache.org/jira/browse/SPARK-10744
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: N Campbell
Priority: Minor


SPARK SQL inherits the same defect as Hive where this statement will not 
parse/execute. See HIVE-9530 

 select c1 from t1 where 1 * cnnull is null 
-vs- 
 select c1 from t1 where (1 * cnnull) is null 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901548#comment-14901548
 ] 

Yin Huai commented on SPARK-10304:
--

I dropped 1.5.1 as the target version since this issue does not prevent users 
from creating a df from a dataset with valid dir structure (the issue at here 
is that users can create a df from a dir with invalid structure). 

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10649:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10649.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10743:


Assignee: (was: Apache Spark)

> keep the name of expression if possible when do cast
> 
>
> Key: SPARK-10743
> URL: https://issues.apache.org/jira/browse/SPARK-10743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901768#comment-14901768
 ] 

Apache Spark commented on SPARK-10743:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8859

> keep the name of expression if possible when do cast
> 
>
> Key: SPARK-10743
> URL: https://issues.apache.org/jira/browse/SPARK-10743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10743:


Assignee: Apache Spark

> keep the name of expression if possible when do cast
> 
>
> Key: SPARK-10743
> URL: https://issues.apache.org/jira/browse/SPARK-10743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-10739:

Description: 
Currently Spark on Yarn uses max attempts to control the failure number, if 
application's failure number reaches to the max attempts, application will not 
be recovered by RM, it is not very effective for long running applications, 
since it will easily exceed the max number at a long time period, also setting 
a very large max attempts will hide the real problem.

So here introduce an attempt window to control the application attempt times, 
this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to 
support long running application, it is quite useful for Spark Streaming, Spark 
shell like applications.


  was:
Currently Spark on Yarn uses max attempts to control the failure number, if 
application's failure number reaches to the max attempts, application will not 
be recovered by RM, it is not very for long running applications, since it will 
easily exceed the max number, also setting a very large max attempts will hide 
the real problem.

So here introduce an attempt window to control the application attempt times, 
this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to 
support long running application, it is quite useful for Spark Streaming, Spark 
shell like applications.



> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901645#comment-14901645
 ] 

Apache Spark commented on SPARK-10649:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8856

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-09-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10738:
--
Shepherd: Xiangrui Meng

> Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
> ---
>
> Key: SPARK-10738
> URL: https://issues.apache.org/jira/browse/SPARK-10738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.6.0
>
>
> Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
> some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10692) Failed batches are never reported through the StreamingListener interface

2015-09-21 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901695#comment-14901695
 ] 

Tathagata Das commented on SPARK-10692:
---

Yes, that is an independent problem, we need to think about that separately. 
Now at the least we need to expose them through StreamingListener so that apps 
that do not immediately exit on one failure, can generate the streaming UI 
correctly.

> Failed batches are never reported through the StreamingListener interface
> -
>
> Key: SPARK-10692
> URL: https://issues.apache.org/jira/browse/SPARK-10692
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> If an output operation fails, then corresponding batch is never marked as 
> completed, as the data structure are not updated properly.
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901694#comment-14901694
 ] 

Andrew Or commented on SPARK-10474:
---

Also, would you mind testing out this commit to see if it fixes your problem?
https://github.com/andrewor14/spark/commits/aggregate-test

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method

2015-09-21 Thread Tatsuya Atsumi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901723#comment-14901723
 ] 

Tatsuya Atsumi commented on SPARK-10723:


Thanks for comment and advice.
RDD.fold seems to be suitable for my case.
Thank you.

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)
Ian created SPARK-10741:
---

 Summary: Hive Query Having/OrderBy against Parquet table is not 
working 
 Key: SPARK-10741
 URL: https://issues.apache.org/jira/browse/SPARK-10741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Ian


Failed Query with Having Clause
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Priority: Major  (was: Critical)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10740:
---

 Summary: handle nondeterministic expressions correctly for set 
operations
 Key: SPARK-10740
 URL: https://issues.apache.org/jira/browse/SPARK-10740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10739:


Assignee: Apache Spark

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10495.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8806
[https://github.com/apache/spark/pull/8806]

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10747) add support for window specification to include how NULLS are ordered

2015-09-21 Thread N Campbell (JIRA)
N Campbell created SPARK-10747:
--

 Summary: add support for window specification to include how NULLS 
are ordered
 Key: SPARK-10747
 URL: https://issues.apache.org/jira/browse/SPARK-10747
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: N Campbell


You cannot express how NULLS are to be sorted in the window order specification 
and have to use a compensating expression to simulate.

Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
near 'nulls'
line 1:82 missing EOF at 'last' near 'nulls';
SQLState:  null

Same limitation as Hive reported in Apache JIRA HIVE-9535 )

This fails
select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
nulls last) from tolap

select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

{code}

Failed Query with OrderBy
{code}
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val filedOrderBy =
  """ SELECT c1, avg ( c2 ) c_avg
| FROM test
| GROUP BY c1
| ORDER BY avg ( c2 )""".stripMargin
TestHive.sql(ddl)
TestHive.sql(filedOrderBy).collect

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS 
aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
{code}

  was:
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }
{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val filedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> 

[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-21 Thread Ian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian updated SPARK-10741:

Description: 
Failed Query with Having Clause
{code}
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }
{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)



  was:
Failed Query with Having Clause
  def testParquetHaving() {
val ddl =
  """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
PARQUET"""
val failedHaving =
  """ SELECT c1, avg ( c2 ) as c_avg
| FROM test
| GROUP BY c1
| HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
TestHive.sql(ddl)
TestHive.sql(failedHaving).collect
  }

org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
bigint)) > cast(5 as double)) as boolean) AS 
havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)




> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-10742:
-

 Summary: Add the ability to embed HTML relative links in job 
descriptions
 Key: SPARK-10742
 URL: https://issues.apache.org/jira/browse/SPARK-10742
 Project: Spark
  Issue Type: Improvement
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor


This is to allow embedding links to other Spark UI tabs within the job 
description. For example, streaming jobs could set descriptions with links 
pointing to the corresponding details page of the batch that the job belongs to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901847#comment-14901847
 ] 

Apache Spark commented on SPARK-10495:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8861

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901676#comment-14901676
 ] 

Apache Spark commented on SPARK-10740:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8858

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10740:


Assignee: (was: Apache Spark)

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10740:


Assignee: Apache Spark

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10652:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Set meaningful job descriptions for streaming related jobs
> --
>
> Key: SPARK-10652
> URL: https://issues.apache.org/jira/browse/SPARK-10652
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Job descriptions will help distinguish jobs of one batch from the other in 
> the Jobs and Stages pages in the Spark UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs

2015-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10652:
--
Summary: Set meaningful job descriptions for streaming related jobs  (was: 
Set good job descriptions for streaming related jobs)

> Set meaningful job descriptions for streaming related jobs
> --
>
> Key: SPARK-10652
> URL: https://issues.apache.org/jira/browse/SPARK-10652
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Job descriptions will help distinguish jobs of one batch from the other in 
> the Jobs and Stages pages in the Spark UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901763#comment-14901763
 ] 

Apache Spark commented on SPARK-10742:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8791

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10742:


Assignee: Tathagata Das  (was: Apache Spark)

> Add the ability to embed HTML relative links in job descriptions
> 
>
> Key: SPARK-10742
> URL: https://issues.apache.org/jira/browse/SPARK-10742
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> This is to allow embedding links to other Spark UI tabs within the job 
> description. For example, streaming jobs could set descriptions with links 
> pointing to the corresponding details page of the batch that the job belongs 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10737:
-
Summary: When using UnsafeRows, SortMergeJoin may return wrong results  
(was: When using UnsafeRow, SortMergeJoin may return wrong results)

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10737:


Assignee: Apache Spark  (was: Yin Huai)

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901499#comment-14901499
 ] 

Apache Spark commented on SPARK-10737:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8854

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file

2015-09-21 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Summary: Update License on conf files and corresponding excludes file  
(was: Check License should not verify conf files for license)

> Update License on conf files and corresponding excludes file
> 
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>Priority: Minor
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file

2015-09-21 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Description: 
Update License on conf files and corresponding excludes file update.
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}

  was:
Check License should not verify conf files for license
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}


> Update License on conf files and corresponding excludes file
> 
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>Priority: Minor
>
> Update License on conf files and corresponding excludes file update.
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901866#comment-14901866
 ] 

Sean Owen commented on SPARK-10739:
---

I'm sure we've discussed this one before and that there's a JIRA for it ... but 
can't for the life of me find it. I feel like [~sandyr] or [~vanzin] commented 
on it. the question was how long back you looked when considering if "a lot" of 
failures had occurred, etc.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901951#comment-14901951
 ] 

Yi Zhou commented on SPARK-10733:
-

Key SQL Query:
INSERT INTO TABLE test_table
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
FROM store_sales ss
INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_category IN ('Books')
AND ss.ss_customer_sk IS NOT NULL
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5

Note:
the store_sales is a big fact table and item is a small dimension table.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10668:
--
Assignee: Kai Sasaki

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10691) Make LogisticRegressionModel's evaluate method public

2015-09-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900818#comment-14900818
 ] 

Xiangrui Meng commented on SPARK-10691:
---

Another option is `score`, following scikit-learn. I don't have strong 
preference between the two. But I'm thinking about how to make the output 
metrics configurable.

> Make LogisticRegressionModel's evaluate method public
> -
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Hao Ren
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10728) Failed to set Jenkins Identity header on email.

2015-09-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10728:
-

 Summary: Failed to set Jenkins Identity header on email.
 Key: SPARK-10728
 URL: https://issues.apache.org/jira/browse/SPARK-10728
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.6.0
Reporter: Xiangrui Meng
Assignee: Josh Rosen


Saw couple Jenkins build failures due to "Failed to set Jenkins Identity header 
on email", e.g.,

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull

{code}
[error] running 
/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt
 -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
-Phive-thriftserver test ; received return code 143
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Failed to set Jenkins Identity header on email.
java.lang.NullPointerException
at 
org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126)
at 
jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188)
at 
jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166)
at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391)
at hudson.tasks.MailSender.createFailureMail(MailSender.java:260)
at hudson.tasks.MailSender.createMail(MailSender.java:178)
at hudson.tasks.MailSender.run(MailSender.java:107)
at hudson.tasks.Mailer.perform(Mailer.java:141)
at 
hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726)
at hudson.model.Build$BuildExecution.post2(Build.java:185)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671)
at hudson.model.Run.execute(Run.java:1766)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:408)
Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com
Finished: FAILURE
{code}

The workaround documented on https://issues.jenkins-ci.org/browse/JENKINS-26740 
is to downgrade mailer to 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900807#comment-14900807
 ] 

Xiangrui Meng commented on SPARK-10668:
---

[~lewuathe] This JIRA blocks several others. Could you try to submit a PR in 2 
days?

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10688:
--
Assignee: (was: Yanbo Liang)

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions

2015-09-21 Thread Nick Buroojy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900823#comment-14900823
 ] 

Nick Buroojy commented on SPARK-9301:
-

I sent a pull request to add these aggregates on the new api; however, I now 
see that this may be blocked by SPARK-9830 
(https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14728451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728451).

Let me know if the next step on this is to wait for the blocking change.


> collect_set and collect_list aggregate functions
> 
>
> Key: SPARK-9301
> URL: https://issues.apache.org/jira/browse/SPARK-9301
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10691) Make LogisticRegressionModel.evaluate() method public

2015-09-21 Thread Hao Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-10691:

Summary: Make LogisticRegressionModel.evaluate() method public  (was: Make 
LogisticRegressionModel's evaluate method public)

> Make LogisticRegressionModel.evaluate() method public
> -
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Hao Ren
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-09-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900802#comment-14900802
 ] 

Xiangrui Meng commented on SPARK-9836:
--

This JIRA is still blocked by SPARK-10668. If you are a first-time contributor, 
I would recommend taking a starter task: 
https://issues.apache.org/jira/issues/?filter=12333209.

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10688:
--
Labels: starter  (was: )

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7989) Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite

2015-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900811#comment-14900811
 ] 

Apache Spark commented on SPARK-7989:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8813

> Fix flaky tests in ExternalShuffleServiceSuite and 
> SparkListenerWithClusterSuite
> 
>
> Key: SPARK-7989
> URL: https://issues.apache.org/jira/browse/SPARK-7989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.4.1, 1.5.0
>
>
> The flaky tests in ExternalShuffleServiceSuite and 
> SparkListenerWithClusterSuite will fail if there are not enough executors up 
> before running the jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-09-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900941#comment-14900941
 ] 

Cheng Lian commented on SPARK-8118:
---

I believe PARQUET-369 is the root cause of this issue, and unfortunately I 
couldn't figure out a way that works all the time to fix the Parquet log 
redirection issue (even through hacky reflection tricks). The best way to fix 
this issue to stop shading slf4j in parquet-format.

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.5.0
>
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.
> A better approach than simply muting these log lines is to redirect Parquet 
> logs via SLF4J, so that we can handle them consistently. In general these 
> logs are very useful. Esp. when used to diagnosing Parquet memory issue and 
> filter push-down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900991#comment-14900991
 ] 

Seth Hendrickson commented on SPARK-10602:
--

My branch is here: 
[SPARK-10641|https://github.com/sethah/spark/tree/SPARK-10641], the 
implementation is mostly 
[here|https://github.com/sethah/spark/blob/SPARK-10641/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala]

Note that I wrote a few tests but they aren't passing currently (also 
scalastyle will fail) as this is a WIP. The update rules are based on 
descriptions 
[here|https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics].
 I also made a pass at implementing a numerically stable update for skewness 
and kurtosis following the [Kahan 
update|https://en.wikipedia.org/wiki/Kahan_summation_algorithm]. The 
implementation follows the algorithms described 
[here|http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf].
 I am not sure that they are working and I haven't been able to find a great 
way to test for numerical stability but I'll try to get that worked out soon.

Also note that the legacy way of implementing aggregates is still required, but 
I have simply issued placeholders for the {{AggregateFunction1}} 
implementations for now. 

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10729) word2vec model save for python

2015-09-21 Thread Joseph A Gartner III (JIRA)
Joseph A Gartner III created SPARK-10729:


 Summary: word2vec model save for python
 Key: SPARK-10729
 URL: https://issues.apache.org/jira/browse/SPARK-10729
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0, 1.4.1
Reporter: Joseph A Gartner III


The ability to save a word2vec model has not been ported to python, and would 
be extremely useful to have given the long training period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-21 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901043#comment-14901043
 ] 

Andrew Or commented on SPARK-10474:
---

I see, thanks for reporting. I think the fix was correct, but it uncovered new 
problems because previously we would have failed before hitting it. I'll 
investigate again.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (SPARK-10723) Add RDD.reduceOption method

2015-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10723.
---
Resolution: Won't Fix

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10626) Create a Java friendly method for randomRDD & RandomDataGenerator on RandomRDDs.

2015-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10626.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8782
[https://github.com/apache/spark/pull/8782]

> Create a Java friendly method for randomRDD & RandomDataGenerator on 
> RandomRDDs.
> 
>
> Key: SPARK-10626
> URL: https://issues.apache.org/jira/browse/SPARK-10626
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> SPARK-3136 added a large number of functions for creating Java RandomRDDs, 
> but for people that want to use custom RandomDataGenerators we should make a 
> Java friendly method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10626) Create a Java friendly method for randomRDD & RandomDataGenerator on RandomRDDs.

2015-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10626:
--
Assignee: holdenk

> Create a Java friendly method for randomRDD & RandomDataGenerator on 
> RandomRDDs.
> 
>
> Key: SPARK-10626
> URL: https://issues.apache.org/jira/browse/SPARK-10626
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> SPARK-3136 added a large number of functions for creating Java RandomRDDs, 
> but for people that want to use custom RandomDataGenerators we should make a 
> Java friendly method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Jon Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901082#comment-14901082
 ] 

Jon Buffington commented on SPARK-5569:
---

Are there any work arounds for this limitation? We are unable to track offsets 
using Kakfa Direct stream and use checkpoints. Our thinking is we need to 
abandon checkpoints and manage recovery outside of Spark checkpoints while this 
limitation exists.

For reference with 1.4.1, we get the following:
{{monospaced}}
...
15/09/21 13:50:53 WARN CheckpointReader: Error reading checkpoint from file 
file:/tmp/page_view_events_cp/checkpoint-144285750
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.streaming.kafka.OffsetRange
...
{{monospaced}}

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at 

[jira] [Comment Edited] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Jon Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901082#comment-14901082
 ] 

Jon Buffington edited comment on SPARK-5569 at 9/21/15 5:57 PM:


Are there any work arounds for this limitation? We are unable to track offsets 
using Kakfa Direct stream and use checkpoints. Our thinking is we need to 
abandon checkpoints and manage recovery outside of Spark checkpoints while this 
limitation exists.

For reference with 1.4.1, we get the following:
{{
...
15/09/21 13:50:53 WARN CheckpointReader: Error reading checkpoint from file 
file:/tmp/page_view_events_cp/checkpoint-144285750
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.streaming.kafka.OffsetRange
...
}}


was (Author: jon_fuseelements):
Are there any work arounds for this limitation? We are unable to track offsets 
using Kakfa Direct stream and use checkpoints. Our thinking is we need to 
abandon checkpoints and manage recovery outside of Spark checkpoints while this 
limitation exists.

For reference with 1.4.1, we get the following:
{{monospaced}}
...
15/09/21 13:50:53 WARN CheckpointReader: Error reading checkpoint from file 
file:/tmp/page_view_events_cp/checkpoint-144285750
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.streaming.kafka.OffsetRange
...
{{monospaced}}

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> 

[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901092#comment-14901092
 ] 

Yin Huai commented on SPARK-10731:
--

Looks like the problem is df.collect does not work well with limit. In Scala, 
{{df.limit(1).rdd.count()}} will also trigger the problem.

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10474:
--
Summary: TungstenAggregation cannot acquire memory for pointer array after 
switching to sort-based  (was: TungstenAggregation cannot acquire memory for 
pointer array)

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10733:
-

 Summary: TungstenAggregation cannot acquire page after switching 
to sort-based
 Key: SPARK-10733
 URL: https://issues.apache.org/jira/browse/SPARK-10733
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker


This is uncovered after fixing SPARK-10474. Stack trace:

{code}
15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 152.0 
(TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 bytes of 
memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8696) Streaming API for Online LDA

2015-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8696:
-
Summary: Streaming API for Online LDA  (was: StreamingLDA)

> Streaming API for Online LDA
> 
>
> Key: SPARK-8696
> URL: https://issues.apache.org/jira/browse/SPARK-8696
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Streaming LDA can be a natural extension from online LDA. 
> Yet for now we need to settle down the implementation for LDA prediction, to 
> support the predictOn method in the streaming version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-10474:
---

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >