[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available
[ https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900277#comment-14900277 ] Madhusudanan Kandasamy commented on SPARK-10644: Standalone cluster manager launches executors during application's registration. So when you say Number of executors: 63, do you mean total number of cores available across all the workers? Can you give the values for spark.executor.cores and spark.cores.max. > Applications wait even if free executors are available > -- > > Key: SPARK-10644 > URL: https://issues.apache.org/jira/browse/SPARK-10644 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.0 > Environment: RHEL 6.5 64 bit >Reporter: Balagopal Nair > > Number of workers: 21 > Number of executors: 63 > Steps to reproduce: > 1. Run 4 jobs each with max cores set to 10 > 2. The first 3 jobs run with 10 each. (30 executors consumed so far) > 3. The 4 th job waits even though there are 33 idle executors. > The reason is that a job will not get executors unless > the total number of EXECUTORS in use < the number of WORKERS > If there are executors available, resources should be allocated to the > pending job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method
[ https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900362#comment-14900362 ] Sean Owen commented on SPARK-10723: --- You can call {{RDD.isEmpty}} to check if it's empty. I don't think this is worth adding another API method for. > Add RDD.reduceOption method > --- > > Key: SPARK-10723 > URL: https://issues.apache.org/jira/browse/SPARK-10723 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Tatsuya Atsumi >Priority: Minor > > h2. Problem > RDD.reduce throws exception if the RDD is empty. > It is appropriate behavior if RDD is expected to be not empty, but if it is > not sure until runtime that the RDD is empty or not, it needs to wrap with > try-catch to call reduce safely. > Example Code > {code} > // This RDD may be empty or not > val rdd: RDD[Int] = originalRdd.filter(_ > 10) > val reduced: Option[Int] = try { > Some(rdd.reduce(_ + _)) > } catch { > // if rdd is empty return None. > case e:UnsupportedOperationException => None > } > {code} > h2. Improvement idea > Scala’s List has reduceOption method, which returns None if List is empty. > If RDD has reduceOption API like Scala’s List, it will become easy to handle > above case. > Example Code > {code} > val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10711) Do not assume spark.submit.deployMode is always set
[ https://issues.apache.org/jira/browse/SPARK-10711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10711. - Resolution: Fixed Assignee: Hossein Falaki Fix Version/s: 1.6.0 > Do not assume spark.submit.deployMode is always set > --- > > Key: SPARK-10711 > URL: https://issues.apache.org/jira/browse/SPARK-10711 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.5.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > > in RRDD.createRProcess() we call RUtils.sparkRPackagePath(), which assumes > "... that Spark properties `spark.master` and `spark.submit.deployMode` are > set." > It is better to assume safe defaults if they are not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10740: Description: We should only push down deterministic filter condition to set operator. For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and we may get 1,3 for the left side and 2,4 for the right side, then the result should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and the result would be 1,3,1,3. For Intersect, let's say there is a non-deterministic condition with a 0.5 possibility to accept a row and we have a row that presents in both sides of an Intersect. Once we push down this condition, the possibility to accept this row will be 0.25. For Except, let's say there is a row that presents in both sides of an Except. This row should not be in the final output. However, if we pushdown a non-deterministic condition, it is possible that this row is rejected from one side and then we output a row that should not be a part of the result. We should only push down deterministic projection to Union. > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side and the result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} @Test def testParquetOrderBy() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(failedOrderBy).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} was: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val filedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(filedOrderBy).collect org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} def testParquetOrderBy() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(failedOrderBy).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} was: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} @Test def testParquetOrderBy() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(failedOrderBy).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17],
[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
[ https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10742: Assignee: Apache Spark (was: Tathagata Das) > Add the ability to embed HTML relative links in job descriptions > > > Key: SPARK-10742 > URL: https://issues.apache.org/jira/browse/SPARK-10742 > Project: Spark > Issue Type: Improvement >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Minor > > This is to allow embedding links to other Spark UI tabs within the job > description. For example, streaming jobs could set descriptions with links > pointing to the corresponding details page of the batch that the job belongs > to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10743) keep the name of expression if possible when do cast
Wenchen Fan created SPARK-10743: --- Summary: keep the name of expression if possible when do cast Key: SPARK-10743 URL: https://issues.apache.org/jira/browse/SPARK-10743 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901901#comment-14901901 ] Sandy Ryza commented on SPARK-10739: I recall there was a JIRA similar to this that avoided killing the application when we reached a certain number of executor failures. However, IIUC, this is about something different: deciding whether to have YARN restart the application when it fails. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901959#comment-14901959 ] Yi Zhou commented on SPARK-10733: - 1) I have never set `spark.task.cpus` and only use by default. 2) scale factor=1000 (1TB data set) 3) Spark conf like below spark.shuffle.manager=sort spark.sql.hive.metastore.version=1.1.0 spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* spark.executor.extraClassPath=/etc/hive/conf spark.driver.extraClassPath=/etc/hive/conf spark.serializer=org.apache.spark.serializer.KryoSerializer > TungstenAggregation cannot acquire page after switching to sort-based > - > > Key: SPARK-10733 > URL: https://issues.apache.org/jira/browse/SPARK-10733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > This is uncovered after fixing SPARK-10474. Stack trace: > {code} > 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage > 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 > bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901959#comment-14901959 ] Yi Zhou edited comment on SPARK-10733 at 9/22/15 5:18 AM: -- 1) I have never set `spark.task.cpus` and only use by default. 2) scale factor=1000 (1TB data set) 3) Spark conf like below spark.shuffle.manager=sort spark.sql.hive.metastore.version=1.1.0 spark.sql.hive.metastore.jars=/usr/lib/hive/lib/\*:/usr/lib/hadoop/client/\* spark.executor.extraClassPath=/etc/hive/conf spark.driver.extraClassPath=/etc/hive/conf spark.serializer=org.apache.spark.serializer.KryoSerializer was (Author: jameszhouyi): 1) I have never set `spark.task.cpus` and only use by default. 2) scale factor=1000 (1TB data set) 3) Spark conf like below spark.shuffle.manager=sort spark.sql.hive.metastore.version=1.1.0 spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* spark.executor.extraClassPath=/etc/hive/conf spark.driver.extraClassPath=/etc/hive/conf spark.serializer=org.apache.spark.serializer.KryoSerializer > TungstenAggregation cannot acquire page after switching to sort-based > - > > Key: SPARK-10733 > URL: https://issues.apache.org/jira/browse/SPARK-10733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > This is uncovered after fixing SPARK-10474. Stack trace: > {code} > 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage > 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 > bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10731: Assignee: Apache Spark > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam >Assignee: Apache Spark > Labels: pyspark > > {code} > df=sqlContext.read.parquet("someparquetfiles") > df.head() > {code} > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901983#comment-14901983 ] Apache Spark commented on SPARK-10731: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8852 > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam > Labels: pyspark > > {code} > df=sqlContext.read.parquet("someparquetfiles") > df.head() > {code} > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10731: Assignee: (was: Apache Spark) > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam > Labels: pyspark > > {code} > df=sqlContext.read.parquet("someparquetfiles") > df.head() > {code} > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901944#comment-14901944 ] Saisai Shao commented on SPARK-10739: - [~srowen] I'm not sure what you mentioned about the previous JIRA is exactly the same thing as I mentioned here. This JIRA is trying to let YARN know that this is a long running service and ignore out of window failures (YARN-611), probably is the different thing you mentioned about. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration
[ https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901979#comment-14901979 ] John Chen commented on SPARK-9595: -- Yes, thanks a lot. > Adding API to SparkConf for kryo serializers registration > - > > Key: SPARK-9595 > URL: https://issues.apache.org/jira/browse/SPARK-9595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1 >Reporter: John Chen >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > Currently SparkConf has a registerKryoClasses API for kryo registration. > However, this only works when you register classes. If you want to register > customized kryo serializers, you'll have to extend the KryoSerializer class > and write some codes. > This is not only very inconvenient, but also require the registration to be > done in compile-time, which is not always possible. Thus, I suggest another > API to SparkConf for registering customized kryo serializers. It could be > like this: > def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10739) Add attempt window for long running Spark application on Yarn
Saisai Shao created SPARK-10739: --- Summary: Add attempt window for long running Spark application on Yarn Key: SPARK-10739 URL: https://issues.apache.org/jira/browse/SPARK-10739 Project: Spark Issue Type: Improvement Components: YARN Reporter: Saisai Shao Priority: Minor Currently Spark on Yarn uses max attempts to control the failure number, if application's failure number reaches to the max attempts, application will not be recovered by RM, it is not very for long running applications, since it will easily exceed the max number, also setting a very large max attempts will hide the real problem. So here introduce an attempt window to control the application attempt times, this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to support long running application, it is quite useful for Spark Streaming, Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901656#comment-14901656 ] Apache Spark commented on SPARK-10739: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8857 > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10739: Assignee: (was: Apache Spark) > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10740: Description: We should only push down deterministic filter condition to set operator. For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and we may get 1,3 for the left side and 2,4 for the right side, then the result should be 1,3,2,4. If we push down this filter, we get 1,3 for both side(we create a new random object with same seed in each side) and the result would be 1,3,1,3. For Intersect, let's say there is a non-deterministic condition with a 0.5 possibility to accept a row and we have a row that presents in both sides of an Intersect. Once we push down this condition, the possibility to accept this row will be 0.25. For Except, let's say there is a row that presents in both sides of an Except. This row should not be in the final output. However, if we pushdown a non-deterministic condition, it is possible that this row is rejected from one side and then we output a row that should not be a part of the result. We should only push down deterministic projection to Union. was: We should only push down deterministic filter condition to set operator. For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, and we may get 1,3 for the left side and 2,4 for the right side, then the result should be 1,3,2,4. If we push down this filter, we get 1,3 for both side and the result would be 1,3,1,3. For Intersect, let's say there is a non-deterministic condition with a 0.5 possibility to accept a row and we have a row that presents in both sides of an Intersect. Once we push down this condition, the possibility to accept this row will be 0.25. For Except, let's say there is a row that presents in both sides of an Except. This row should not be in the final output. However, if we pushdown a non-deterministic condition, it is possible that this row is rejected from one side and then we output a row that should not be a part of the result. We should only push down deterministic projection to Union. > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter
[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10310: --- Assignee: zhichao-li > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimiter > -- > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yi Zhou >Assignee: zhichao-li >Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_skIS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > {noformat} > Expected result: > {noformat} > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10727) Dataframe count is zero after 'except' operation
Satish Kolli created SPARK-10727: Summary: Dataframe count is zero after 'except' operation Key: SPARK-10727 URL: https://issues.apache.org/jira/browse/SPARK-10727 Project: Spark Issue Type: Bug Reporter: Satish Kolli Data frame count after the except operation is always returning zero even when there is data in the resulting data frame. {code} scala> val df1 = sc.parallelize(1 to 5).toDF("V1") df1: org.apache.spark.sql.DataFrame = [V1: int] scala> val df2 = sc.parallelize(2 to 5).toDF("V2") df2: org.apache.spark.sql.DataFrame = [V2: int] scala> df1.except(df2).show() +---+ | V1| +---+ | 1| +---+ scala> df1.except(df2).count() res4: Long = 0 scala> df1.except(df2).rdd.count() res5: Long = 1 scala> {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900562#comment-14900562 ] Yi Zhou commented on SPARK-10474: - Hi [~andrewor14] I found a new error like below after apply this PR. Not sure if some one have reported it or I miss other PR relative to below issue. 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > Aggregation failed with unable to acquire memory > > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at >
[jira] [Commented] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish
[ https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900583#comment-14900583 ] Apache Spark commented on SPARK-10726: -- User 'KaiXinXiaoLei' has created a pull request for this issue: https://github.com/apache/spark/pull/8848 > Using dynamic-executor-allocation,When jobs are submitted parallelly, > executors will be removed before tasks finish > --- > > Key: SPARK-10726 > URL: https://issues.apache.org/jira/browse/SPARK-10726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: KaiXinXIaoLei > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9883) Distance to each cluster given a point
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900648#comment-14900648 ] Apache Spark commented on SPARK-9883: - User 'BertrandDechoux' has created a pull request for this issue: https://github.com/apache/spark/pull/8849 > Distance to each cluster given a point > -- > > Key: SPARK-9883 > URL: https://issues.apache.org/jira/browse/SPARK-9883 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Bertrand Dechoux >Priority: Minor > > Right now KMeansModel provides only a 'predict 'method which returns the > index of the closest cluster. > https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) > It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9883) Distance to each cluster given a point
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9883: --- Assignee: (was: Apache Spark) > Distance to each cluster given a point > -- > > Key: SPARK-9883 > URL: https://issues.apache.org/jira/browse/SPARK-9883 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Bertrand Dechoux >Priority: Minor > > Right now KMeansModel provides only a 'predict 'method which returns the > index of the closest cluster. > https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) > It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10727) Dataframe count is zero after 'except' operation
[ https://issues.apache.org/jira/browse/SPARK-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Satish Kolli updated SPARK-10727: - Affects Version/s: 1.5.0 > Dataframe count is zero after 'except' operation > > > Key: SPARK-10727 > URL: https://issues.apache.org/jira/browse/SPARK-10727 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Satish Kolli > > Data frame count after the except operation is always returning zero even > when there is data in the resulting data frame. > {code} > scala> val df1 = sc.parallelize(1 to 5).toDF("V1") > df1: org.apache.spark.sql.DataFrame = [V1: int] > scala> val df2 = sc.parallelize(2 to 5).toDF("V2") > df2: org.apache.spark.sql.DataFrame = [V2: int] > scala> df1.except(df2).show() > +---+ > | V1| > +---+ > | 1| > +---+ > scala> df1.except(df2).count() > res4: Long = 0 > scala> df1.except(df2).rdd.count() > res5: Long = 1 > scala> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish
KaiXinXIaoLei created SPARK-10726: - Summary: Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish Key: SPARK-10726 URL: https://issues.apache.org/jira/browse/SPARK-10726 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: KaiXinXIaoLei Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9835) Iteratively reweighted least squares solver for GLMs
[ https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900694#comment-14900694 ] Mohamed Baddar commented on SPARK-9835: --- Can I work on this issue Thanks > Iteratively reweighted least squares solver for GLMs > > > Key: SPARK-9835 > URL: https://issues.apache.org/jira/browse/SPARK-9835 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-9834, we can implement iteratively reweighted least squares > (IRLS) solver for GLMs with other families and link functions. It could > provide R-like summary statistics after training, but the number of features > cannot be very large, e.g. more than 4096. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError
Josiah Samuel Sathiadass created SPARK-10725: Summary: HiveSparkSubmitSuite fails with NoClassDefFoundError Key: SPARK-10725 URL: https://issues.apache.org/jira/browse/SPARK-10725 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk Reporter: Josiah Samuel Sathiadass Priority: Minor HiveSparkSubmitSuite fails with NoClassDefFoundError as follows, Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.spark.sql.hive.test.TestHiveContext TestHiveContext class is part of sql_hive jar. Since this jar is not part of the list of dependencies we encounter this issue. So by adding the respective jar to the dependency list we should be able to successfully run all the tests under HiveSparkSubmitSuite . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10725: Assignee: (was: Apache Spark) > HiveSparkSubmitSuite fails with NoClassDefFoundError > > > Key: SPARK-10725 > URL: https://issues.apache.org/jira/browse/SPARK-10725 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk >Reporter: Josiah Samuel Sathiadass >Priority: Minor > > HiveSparkSubmitSuite fails with NoClassDefFoundError as follows, > Exception in thread "main" java.lang.NoClassDefFoundError: > org.apache.spark.sql.hive.test.TestHiveContext > TestHiveContext class is part of sql_hive jar. Since this jar is not part of > the list of dependencies we encounter this issue. > So by adding the respective jar to the dependency list we should be able to > successfully run all the tests under HiveSparkSubmitSuite . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900497#comment-14900497 ] Apache Spark commented on SPARK-10725: -- User 'josiahsams' has created a pull request for this issue: https://github.com/apache/spark/pull/8847 > HiveSparkSubmitSuite fails with NoClassDefFoundError > > > Key: SPARK-10725 > URL: https://issues.apache.org/jira/browse/SPARK-10725 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk >Reporter: Josiah Samuel Sathiadass >Priority: Minor > > HiveSparkSubmitSuite fails with NoClassDefFoundError as follows, > Exception in thread "main" java.lang.NoClassDefFoundError: > org.apache.spark.sql.hive.test.TestHiveContext > TestHiveContext class is part of sql_hive jar. Since this jar is not part of > the list of dependencies we encounter this issue. > So by adding the respective jar to the dependency list we should be able to > successfully run all the tests under HiveSparkSubmitSuite . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10725) HiveSparkSubmitSuite fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10725: Assignee: Apache Spark > HiveSparkSubmitSuite fails with NoClassDefFoundError > > > Key: SPARK-10725 > URL: https://issues.apache.org/jira/browse/SPARK-10725 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: IBM Power Servers running Ubuntu 14.04 LE with IBM J9 jdk >Reporter: Josiah Samuel Sathiadass >Assignee: Apache Spark >Priority: Minor > > HiveSparkSubmitSuite fails with NoClassDefFoundError as follows, > Exception in thread "main" java.lang.NoClassDefFoundError: > org.apache.spark.sql.hive.test.TestHiveContext > TestHiveContext class is part of sql_hive jar. Since this jar is not part of > the list of dependencies we encounter this issue. > So by adding the respective jar to the dependency list we should be able to > successfully run all the tests under HiveSparkSubmitSuite . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish
[ https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10726: Assignee: Apache Spark > Using dynamic-executor-allocation,When jobs are submitted parallelly, > executors will be removed before tasks finish > --- > > Key: SPARK-10726 > URL: https://issues.apache.org/jira/browse/SPARK-10726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: KaiXinXIaoLei >Assignee: Apache Spark > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10726) Using dynamic-executor-allocation,When jobs are submitted parallelly, executors will be removed before tasks finish
[ https://issues.apache.org/jira/browse/SPARK-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10726: Assignee: (was: Apache Spark) > Using dynamic-executor-allocation,When jobs are submitted parallelly, > executors will be removed before tasks finish > --- > > Key: SPARK-10726 > URL: https://issues.apache.org/jira/browse/SPARK-10726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: KaiXinXIaoLei > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901609#comment-14901609 ] Jon Buffington commented on SPARK-5569: --- Cody, We were able to work around the issue by destructuring the OffsetRange type into its parts (i.e., string, long, ints). Below is our flow for reference. Our only concern is whether the transform operation runs at the driver as we have use a singleton on the driver to persist the last captured offsets. Can you confirm that the stream transform runs at the driver? {noformat} // Recover or create the stream. val ssc = StreamingContext.getOrCreate(checkpointPath, () => { createStreamingContext(checkpointPath) }) ... def createStreamingContext(checkpointPath: String): StreamingContext = { val ssc = new StreamingContext(SparkConf, Minutes(1)) ssc.checkpoint(checkpointPath) createStream(ssc) ssc } ... def createStream((ssc: StreamingContext): Unit = { ... KafkaUtils.createDirectStream[K, V, KD, VD, R](ssc, kafkaParams, fromOffsets, messageHandler) // "Note that the typecast to HasOffsetRanges will only succeed if it is // done in the first method called on the directKafkaStream, not later down // a chain of methods. You can use transform() instead of foreachRDD() as // your first method call in order to access offsets, then call further // Spark methods. However, be aware that the one-to-one mapping between RDD // partition and Kafka partition does not remain after any methods that // shuffle or repartition, e.g. reduceByKey() or window()." .transform { rdd => ... val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges Ranger.captureOffsetRanges(offsetRanges) // De-structure the OffsetRange type into its parts and save in a singleton. ... rdd } ... } {noformat} > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901632#comment-14901632 ] Yin Huai commented on SPARK-10474: -- [~jameszhouyi] [~xjrk] What is your workloads? How large is the data (if the data is generated, can you share the script for data generation)? What is your query? And, did you set any spark conf? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table.
[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter
[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901774#comment-14901774 ] Apache Spark commented on SPARK-10310: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8860 > [Spark SQL] All result records will be popluated into ONE line during the > script transform due to missing the correct line/filed delimiter > -- > > Key: SPARK-10310 > URL: https://issues.apache.org/jira/browse/SPARK-10310 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yi Zhou >Assignee: zhichao-li >Priority: Critical > > There is real case using python stream script in Spark SQL query. We found > that all result records were wroten in ONE line as input from "select" > pipeline for python script and so it caused script will not identify each > record.Other, filed separator in spark sql will be '^A' or '\001' which is > inconsistent/incompatible the '\t' in Hive implementation. > Key query: > {code:sql} > CREATE VIEW temp1 AS > SELECT * > FROM > ( > FROM > ( > SELECT > c.wcs_user_sk, > w.wp_type, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams c, web_page w > WHERE c.wcs_web_page_sk = w.wp_web_page_sk > AND c.wcs_web_page_sk IS NOT NULL > AND c.wcs_user_sk IS NOT NULL > AND c.wcs_sales_skIS NULL --abandoned implies: no sale > DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wp_type > USING 'python sessionize.py 3600' > AS ( > wp_type STRING, > tstamp BIGINT, > sessionid STRING) > ) sessionized > {code} > Key Python script: > {noformat} > for line in sys.stdin: > user_sk, tstamp_str, value = line.strip().split("\t") > {noformat} > Sample SELECT result: > {noformat} > ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview > {noformat} > Expected result: > {noformat} > 31 3237764860 feedback > 31 3237769106 dynamic > 31 3237779027 review > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10744) parser error (constant * column is null interpreted as constant * boolean)
N Campbell created SPARK-10744: -- Summary: parser error (constant * column is null interpreted as constant * boolean) Key: SPARK-10744 URL: https://issues.apache.org/jira/browse/SPARK-10744 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: N Campbell Priority: Minor SPARK SQL inherits the same defect as Hive where this statement will not parse/execute. See HIVE-9530 select c1 from t1 where 1 * cnnull is null -vs- select c1 from t1 where (1 * cnnull) is null -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901548#comment-14901548 ] Yin Huai commented on SPARK-10304: -- I dropped 1.5.1 as the target version since this issue does not prevent users from creating a df from a dataset with valid dir structure (the issue at here is that users can create a df from a dir with invalid structure). > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread
[ https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10649: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.6.0) > Streaming jobs unexpectedly inherits job group, job descriptions from context > starting thread > - > > Key: SPARK-10649 > URL: https://issues.apache.org/jira/browse/SPARK-10649 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > The job group, and job descriptions information is passed through thread > local properties, and get inherited by child threads. In case of spark > streaming, the streaming jobs inherit these properties from the thread that > called streamingContext.start(). This may not make sense. > 1. Job group: This is mainly used for cancelling a group of jobs together. It > does not make sense to cancel streaming jobs like this, as the effect will be > unpredictable. And its not a valid usecase any way, to cancel a streaming > context, call streamingContext.stop() > 2. Job description: This is used to pass on nice text descriptions for jobs > to show up in the UI. The job description of the thread that calls > streamingContext.start() is not useful for all the streaming jobs, as it does > not make sense for all of the streaming jobs to have the same description, > and the description may or may not be related to streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread
[ https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-10649. --- Resolution: Fixed Fix Version/s: 1.6.0 > Streaming jobs unexpectedly inherits job group, job descriptions from context > starting thread > - > > Key: SPARK-10649 > URL: https://issues.apache.org/jira/browse/SPARK-10649 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > The job group, and job descriptions information is passed through thread > local properties, and get inherited by child threads. In case of spark > streaming, the streaming jobs inherit these properties from the thread that > called streamingContext.start(). This may not make sense. > 1. Job group: This is mainly used for cancelling a group of jobs together. It > does not make sense to cancel streaming jobs like this, as the effect will be > unpredictable. And its not a valid usecase any way, to cancel a streaming > context, call streamingContext.stop() > 2. Job description: This is used to pass on nice text descriptions for jobs > to show up in the UI. The job description of the thread that calls > streamingContext.start() is not useful for all the streaming jobs, as it does > not make sense for all of the streaming jobs to have the same description, > and the description may or may not be related to streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10743: Assignee: (was: Apache Spark) > keep the name of expression if possible when do cast > > > Key: SPARK-10743 > URL: https://issues.apache.org/jira/browse/SPARK-10743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10743) keep the name of expression if possible when do cast
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901768#comment-14901768 ] Apache Spark commented on SPARK-10743: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8859 > keep the name of expression if possible when do cast > > > Key: SPARK-10743 > URL: https://issues.apache.org/jira/browse/SPARK-10743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10743) keep the name of expression if possible when do cast
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10743: Assignee: Apache Spark > keep the name of expression if possible when do cast > > > Key: SPARK-10743 > URL: https://issues.apache.org/jira/browse/SPARK-10743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10304: - Target Version/s: 1.6.0 (was: 1.6.0, 1.5.1) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-10739: Description: Currently Spark on Yarn uses max attempts to control the failure number, if application's failure number reaches to the max attempts, application will not be recovered by RM, it is not very effective for long running applications, since it will easily exceed the max number at a long time period, also setting a very large max attempts will hide the real problem. So here introduce an attempt window to control the application attempt times, this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to support long running application, it is quite useful for Spark Streaming, Spark shell like applications. was: Currently Spark on Yarn uses max attempts to control the failure number, if application's failure number reaches to the max attempts, application will not be recovered by RM, it is not very for long running applications, since it will easily exceed the max number, also setting a very large max attempts will hide the real problem. So here introduce an attempt window to control the application attempt times, this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ to support long running application, it is quite useful for Spark Streaming, Spark shell like applications. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread
[ https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901645#comment-14901645 ] Apache Spark commented on SPARK-10649: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8856 > Streaming jobs unexpectedly inherits job group, job descriptions from context > starting thread > - > > Key: SPARK-10649 > URL: https://issues.apache.org/jira/browse/SPARK-10649 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > The job group, and job descriptions information is passed through thread > local properties, and get inherited by child threads. In case of spark > streaming, the streaming jobs inherit these properties from the thread that > called streamingContext.start(). This may not make sense. > 1. Job group: This is mainly used for cancelling a group of jobs together. It > does not make sense to cancel streaming jobs like this, as the effect will be > unpredictable. And its not a valid usecase any way, to cancel a streaming > context, call streamingContext.stop() > 2. Job description: This is used to pass on nice text descriptions for jobs > to show up in the UI. The job description of the thread that calls > streamingContext.start() is not useful for all the streaming jobs, as it does > not make sense for all of the streaming jobs to have the same description, > and the description may or may not be related to streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
[ https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10738: -- Shepherd: Xiangrui Meng > Refactoring `Instance` out from LOR and LIR, and also cleaning up some code > --- > > Key: SPARK-10738 > URL: https://issues.apache.org/jira/browse/SPARK-10738 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.6.0 > > > Refactoring `Instance` case class out from LOR and LIR, and also cleaning up > some code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10692) Failed batches are never reported through the StreamingListener interface
[ https://issues.apache.org/jira/browse/SPARK-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901695#comment-14901695 ] Tathagata Das commented on SPARK-10692: --- Yes, that is an independent problem, we need to think about that separately. Now at the least we need to expose them through StreamingListener so that apps that do not immediately exit on one failure, can generate the streaming UI correctly. > Failed batches are never reported through the StreamingListener interface > - > > Key: SPARK-10692 > URL: https://issues.apache.org/jira/browse/SPARK-10692 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Shixiong Zhu >Priority: Blocker > > If an output operation fails, then corresponding batch is never marked as > completed, as the data structure are not updated properly. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901694#comment-14901694 ] Andrew Or commented on SPARK-10474: --- Also, would you mind testing out this commit to see if it fixes your problem? https://github.com/andrewor14/spark/commits/aggregate-test > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method
[ https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901723#comment-14901723 ] Tatsuya Atsumi commented on SPARK-10723: Thanks for comment and advice. RDD.fold seems to be suitable for my case. Thank you. > Add RDD.reduceOption method > --- > > Key: SPARK-10723 > URL: https://issues.apache.org/jira/browse/SPARK-10723 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Tatsuya Atsumi >Priority: Minor > > h2. Problem > RDD.reduce throws exception if the RDD is empty. > It is appropriate behavior if RDD is expected to be not empty, but if it is > not sure until runtime that the RDD is empty or not, it needs to wrap with > try-catch to call reduce safely. > Example Code > {code} > // This RDD may be empty or not > val rdd: RDD[Int] = originalRdd.filter(_ > 10) > val reduced: Option[Int] = try { > Some(rdd.reduce(_ + _)) > } catch { > // if rdd is empty return None. > case e:UnsupportedOperationException => None > } > {code} > h2. Improvement idea > Scala’s List has reduceOption method, which returns None if List is empty. > If RDD has reduceOption API like Scala’s List, it will become easy to handle > above case. > Example Code > {code} > val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
Ian created SPARK-10741: --- Summary: Hive Query Having/OrderBy against Parquet table is not working Key: SPARK-10741 URL: https://issues.apache.org/jira/browse/SPARK-10741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Ian Failed Query with Having Clause def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10304: - Priority: Major (was: Critical) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10740) handle nondeterministic expressions correctly for set operations
Wenchen Fan created SPARK-10740: --- Summary: handle nondeterministic expressions correctly for set operations Key: SPARK-10740 URL: https://issues.apache.org/jira/browse/SPARK-10740 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10739: Assignee: Apache Spark > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10495. Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved by pull request 8806 [https://github.com/apache/spark/pull/8806] > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10747) add support for window specification to include how NULLS are ordered
N Campbell created SPARK-10747: -- Summary: add support for window specification to include how NULLS are ordered Key: SPARK-10747 URL: https://issues.apache.org/jira/browse/SPARK-10747 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: N Campbell You cannot express how NULLS are to be sorted in the window order specification and have to use a compensating expression to simulate. Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' near 'nulls' line 1:82 missing EOF at 'last' near 'nulls'; SQLState: null Same limitation as Hive reported in Apache JIRA HIVE-9535 ) This fails select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc nulls last) from tolap select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) {code} Failed Query with OrderBy {code} val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val filedOrderBy = """ SELECT c1, avg ( c2 ) c_avg | FROM test | GROUP BY c1 | ORDER BY avg ( c2 )""".stripMargin TestHive.sql(ddl) TestHive.sql(filedOrderBy).collect org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) {code} was: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } {code} org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5 as double)) as boolean) AS > havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > {code} > Failed Query with OrderBy > {code} > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val filedOrderBy = > """ SELECT c1, avg ( c2 ) c_avg > | FROM test > | GROUP BY c1 >
[jira] [Updated] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working
[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-10741: Description: Failed Query with Having Clause {code} def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } {code} org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) was: Failed Query with Having Clause def testParquetHaving() { val ddl = """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET""" val failedHaving = """ SELECT c1, avg ( c2 ) as c_avg | FROM test | GROUP BY c1 | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin TestHive.sql(ddl) TestHive.sql(failedHaving).collect } org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as bigint)) > cast(5 as double)) as boolean) AS havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > Hive Query Having/OrderBy against Parquet table is not working > --- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5 as double)) as boolean) AS > havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
Tathagata Das created SPARK-10742: - Summary: Add the ability to embed HTML relative links in job descriptions Key: SPARK-10742 URL: https://issues.apache.org/jira/browse/SPARK-10742 Project: Spark Issue Type: Improvement Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor This is to allow embedding links to other Spark UI tabs within the job description. For example, streaming jobs could set descriptions with links pointing to the corresponding details page of the batch that the job belongs to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901847#comment-14901847 ] Apache Spark commented on SPARK-10495: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8861 > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901676#comment-14901676 ] Apache Spark commented on SPARK-10740: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8858 > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10740: Assignee: (was: Apache Spark) > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10740) handle nondeterministic expressions correctly for set operations
[ https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10740: Assignee: Apache Spark > handle nondeterministic expressions correctly for set operations > > > Key: SPARK-10740 > URL: https://issues.apache.org/jira/browse/SPARK-10740 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > > We should only push down deterministic filter condition to set operator. > For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, > and we may get 1,3 for the left side and 2,4 for the right side, then the > result should be 1,3,2,4. If we push down this filter, we get 1,3 for both > side(we create a new random object with same seed in each side) and the > result would be 1,3,1,3. > For Intersect, let's say there is a non-deterministic condition with a 0.5 > possibility to accept a row and we have a row that presents in both sides of > an Intersect. Once we push down this condition, the possibility to accept > this row will be 0.25. > For Except, let's say there is a row that presents in both sides of an > Except. This row should not be in the final output. However, if we pushdown a > non-deterministic condition, it is possible that this row is rejected from > one side and then we output a row that should not be a part of the result. > We should only push down deterministic projection to Union. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs
[ https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10652: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.6.0) > Set meaningful job descriptions for streaming related jobs > -- > > Key: SPARK-10652 > URL: https://issues.apache.org/jira/browse/SPARK-10652 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > Job descriptions will help distinguish jobs of one batch from the other in > the Jobs and Stages pages in the Spark UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10652) Set meaningful job descriptions for streaming related jobs
[ https://issues.apache.org/jira/browse/SPARK-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10652: -- Summary: Set meaningful job descriptions for streaming related jobs (was: Set good job descriptions for streaming related jobs) > Set meaningful job descriptions for streaming related jobs > -- > > Key: SPARK-10652 > URL: https://issues.apache.org/jira/browse/SPARK-10652 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > Job descriptions will help distinguish jobs of one batch from the other in > the Jobs and Stages pages in the Spark UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
[ https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901763#comment-14901763 ] Apache Spark commented on SPARK-10742: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8791 > Add the ability to embed HTML relative links in job descriptions > > > Key: SPARK-10742 > URL: https://issues.apache.org/jira/browse/SPARK-10742 > Project: Spark > Issue Type: Improvement >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > > This is to allow embedding links to other Spark UI tabs within the job > description. For example, streaming jobs could set descriptions with links > pointing to the corresponding details page of the batch that the job belongs > to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10742) Add the ability to embed HTML relative links in job descriptions
[ https://issues.apache.org/jira/browse/SPARK-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10742: Assignee: Tathagata Das (was: Apache Spark) > Add the ability to embed HTML relative links in job descriptions > > > Key: SPARK-10742 > URL: https://issues.apache.org/jira/browse/SPARK-10742 > Project: Spark > Issue Type: Improvement >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > > This is to allow embedding links to other Spark UI tabs within the job > description. For example, streaming jobs could set descriptions with links > pointing to the corresponding details page of the batch that the job belongs > to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10737: - Summary: When using UnsafeRows, SortMergeJoin may return wrong results (was: When using UnsafeRow, SortMergeJoin may return wrong results) > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10737: Assignee: Apache Spark (was: Yin Huai) > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results
[ https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901499#comment-14901499 ] Apache Spark commented on SPARK-10737: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8854 > When using UnsafeRows, SortMergeJoin may return wrong results > - > > Key: SPARK-10737 > URL: https://issues.apache.org/jira/browse/SPARK-10737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j") > val df2 = > df1 > .join(df1.select(df1("i")), "i") > .select(df1("i"), df1("j")) > val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1") > val df4 = > df2 > .join(df3, df2("i") === df3("i1")) > .withColumn("diff", $"j" - $"j1") > df4.show(100, false) > +--+---+--+---++ > |i |j |i1|j1 |diff| > +--+---+--+---++ > |str_2 |2 |str_2 |2 |0 | > |str_7 |7 |str_2 |2 |5 | > |str_10|10 |str_10|10 |0 | > |str_3 |3 |str_3 |3 |0 | > |str_8 |8 |str_3 |3 |5 | > |str_4 |4 |str_4 |4 |0 | > |str_9 |9 |str_4 |4 |5 | > |str_5 |5 |str_5 |5 |0 | > |str_1 |1 |str_1 |1 |0 | > |str_6 |6 |str_1 |1 |5 | > +--+---+--+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file
[ https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-10718: Summary: Update License on conf files and corresponding excludes file (was: Check License should not verify conf files for license) > Update License on conf files and corresponding excludes file > > > Key: SPARK-10718 > URL: https://issues.apache.org/jira/browse/SPARK-10718 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Rekha Joshi >Priority: Minor > > Check License should not verify conf files for license > {code} > Apache license header missing from multiple script and required files > Could not find Apache license headers in the following files: > !? <>spark/conf/spark-defaults.conf > [error] running <>spark/dev/check-license ; received return code 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10718) Update License on conf files and corresponding excludes file
[ https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-10718: Description: Update License on conf files and corresponding excludes file update. {code} Apache license header missing from multiple script and required files Could not find Apache license headers in the following files: !? <>spark/conf/spark-defaults.conf [error] running <>spark/dev/check-license ; received return code 1 {code} was: Check License should not verify conf files for license {code} Apache license header missing from multiple script and required files Could not find Apache license headers in the following files: !? <>spark/conf/spark-defaults.conf [error] running <>spark/dev/check-license ; received return code 1 {code} > Update License on conf files and corresponding excludes file > > > Key: SPARK-10718 > URL: https://issues.apache.org/jira/browse/SPARK-10718 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Rekha Joshi >Priority: Minor > > Update License on conf files and corresponding excludes file update. > {code} > Apache license header missing from multiple script and required files > Could not find Apache license headers in the following files: > !? <>spark/conf/spark-defaults.conf > [error] running <>spark/dev/check-license ; received return code 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901866#comment-14901866 ] Sean Owen commented on SPARK-10739: --- I'm sure we've discussed this one before and that there's a JIRA for it ... but can't for the life of me find it. I feel like [~sandyr] or [~vanzin] commented on it. the question was how long back you looked when considering if "a lot" of failures had occurred, etc. > Add attempt window for long running Spark application on Yarn > - > > Key: SPARK-10739 > URL: https://issues.apache.org/jira/browse/SPARK-10739 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > Currently Spark on Yarn uses max attempts to control the failure number, if > application's failure number reaches to the max attempts, application will > not be recovered by RM, it is not very effective for long running > applications, since it will easily exceed the max number at a long time > period, also setting a very large max attempts will hide the real problem. > So here introduce an attempt window to control the application attempt times, > this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ > to support long running application, it is quite useful for Spark Streaming, > Spark shell like applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901951#comment-14901951 ] Yi Zhou commented on SPARK-10733: - Key SQL Query: INSERT INTO TABLE test_table SELECT ss.ss_customer_sk AS cid, count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 FROM store_sales ss INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk WHERE i.i_category IN ('Books') AND ss.ss_customer_sk IS NOT NULL GROUP BY ss.ss_customer_sk HAVING count(ss.ss_item_sk) > 5 Note: the store_sales is a big fact table and item is a small dimension table. > TungstenAggregation cannot acquire page after switching to sort-based > - > > Key: SPARK-10733 > URL: https://issues.apache.org/jira/browse/SPARK-10733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > This is uncovered after fixing SPARK-10474. Stack trace: > {code} > 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage > 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 > bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small
[ https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10668: -- Assignee: Kai Sasaki > Use WeightedLeastSquares in LinearRegression with L2 regularization if the > number of features is small > -- > > Key: SPARK-10668 > URL: https://issues.apache.org/jira/browse/SPARK-10668 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Kai Sasaki >Priority: Critical > > If the number of features is small (<=4096) and the regularization is L2, we > should use WeightedLeastSquares to solve the problem rather than L-BFGS. The > former requires only one pass to the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10691) Make LogisticRegressionModel's evaluate method public
[ https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900818#comment-14900818 ] Xiangrui Meng commented on SPARK-10691: --- Another option is `score`, following scikit-learn. I don't have strong preference between the two. But I'm thinking about how to make the output metrics configurable. > Make LogisticRegressionModel's evaluate method public > - > > Key: SPARK-10691 > URL: https://issues.apache.org/jira/browse/SPARK-10691 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Hao Ren > > The following method in {{LogisticRegressionModel}} is marked as {{private}}, > which prevents users from creating a summary on any given data set. Check > [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272]. > {code} > // TODO: decide on a good name before exposing to public API > private[classification] def evaluate(dataset: DataFrame) > : LogisticRegressionSummary = { > new BinaryLogisticRegressionSummary( > this.transform(dataset), > $(probabilityCol), > $(labelCol)) > } > {code} > This method is definitely necessary to test model performance. > By the way, the name {{evaluate}} is already pretty good for me. > [~mengxr] Could you check this ? Thx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10728) Failed to set Jenkins Identity header on email.
Xiangrui Meng created SPARK-10728: - Summary: Failed to set Jenkins Identity header on email. Key: SPARK-10728 URL: https://issues.apache.org/jira/browse/SPARK-10728 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.6.0 Reporter: Xiangrui Meng Assignee: Josh Rosen Saw couple Jenkins build failures due to "Failed to set Jenkins Identity header on email", e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3572/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/consoleFull {code} [error] running /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl -Phive-thriftserver test ; received return code 143 Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results ERROR: Failed to set Jenkins Identity header on email. java.lang.NullPointerException at org.jenkinsci.main.modules.instance_identity.InstanceIdentity.get(InstanceIdentity.java:126) at jenkins.plugins.mailer.tasks.MimeMessageBuilder.setJenkinsInstanceIdent(MimeMessageBuilder.java:188) at jenkins.plugins.mailer.tasks.MimeMessageBuilder.buildMimeMessage(MimeMessageBuilder.java:166) at hudson.tasks.MailSender.createEmptyMail(MailSender.java:391) at hudson.tasks.MailSender.createFailureMail(MailSender.java:260) at hudson.tasks.MailSender.createMail(MailSender.java:178) at hudson.tasks.MailSender.run(MailSender.java:107) at hudson.tasks.Mailer.perform(Mailer.java:141) at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:726) at hudson.model.Build$BuildExecution.post2(Build.java:185) at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:671) at hudson.model.Run.execute(Run.java:1766) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:408) Sending e-mails to: spark-bu...@databricks.com rosenvi...@gmail.com Finished: FAILURE {code} The workaround documented on https://issues.jenkins-ci.org/browse/JENKINS-26740 is to downgrade mailer to 1.12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small
[ https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900807#comment-14900807 ] Xiangrui Meng commented on SPARK-10668: --- [~lewuathe] This JIRA blocks several others. Could you try to submit a PR in 2 days? > Use WeightedLeastSquares in LinearRegression with L2 regularization if the > number of features is small > -- > > Key: SPARK-10668 > URL: https://issues.apache.org/jira/browse/SPARK-10668 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > If the number of features is small (<=4096) and the regularization is L2, we > should use WeightedLeastSquares to solve the problem rather than L-BFGS. The > former requires only one pass to the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10688) Python API for AFTSurvivalRegression
[ https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10688: -- Assignee: (was: Yanbo Liang) > Python API for AFTSurvivalRegression > > > Key: SPARK-10688 > URL: https://issues.apache.org/jira/browse/SPARK-10688 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Xiangrui Meng > Labels: starter > > After SPARK-10686, we should add Python API for AFTSurvivalRegression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900823#comment-14900823 ] Nick Buroojy commented on SPARK-9301: - I sent a pull request to add these aggregates on the new api; however, I now see that this may be blocked by SPARK-9830 (https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14728451=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728451). Let me know if the next step on this is to wait for the blocking change. > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10691) Make LogisticRegressionModel.evaluate() method public
[ https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-10691: Summary: Make LogisticRegressionModel.evaluate() method public (was: Make LogisticRegressionModel's evaluate method public) > Make LogisticRegressionModel.evaluate() method public > - > > Key: SPARK-10691 > URL: https://issues.apache.org/jira/browse/SPARK-10691 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Hao Ren > > The following method in {{LogisticRegressionModel}} is marked as {{private}}, > which prevents users from creating a summary on any given data set. Check > [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272]. > {code} > // TODO: decide on a good name before exposing to public API > private[classification] def evaluate(dataset: DataFrame) > : LogisticRegressionSummary = { > new BinaryLogisticRegressionSummary( > this.transform(dataset), > $(probabilityCol), > $(labelCol)) > } > {code} > This method is definitely necessary to test model performance. > By the way, the name {{evaluate}} is already pretty good for me. > [~mengxr] Could you check this ? Thx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900802#comment-14900802 ] Xiangrui Meng commented on SPARK-9836: -- This JIRA is still blocked by SPARK-10668. If you are a first-time contributor, I would recommend taking a starter task: https://issues.apache.org/jira/issues/?filter=12333209. > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10688) Python API for AFTSurvivalRegression
[ https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10688: -- Labels: starter (was: ) > Python API for AFTSurvivalRegression > > > Key: SPARK-10688 > URL: https://issues.apache.org/jira/browse/SPARK-10688 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Xiangrui Meng > Labels: starter > > After SPARK-10686, we should add Python API for AFTSurvivalRegression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7989) Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
[ https://issues.apache.org/jira/browse/SPARK-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900811#comment-14900811 ] Apache Spark commented on SPARK-7989: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/8813 > Fix flaky tests in ExternalShuffleServiceSuite and > SparkListenerWithClusterSuite > > > Key: SPARK-7989 > URL: https://issues.apache.org/jira/browse/SPARK-7989 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > Labels: flaky-test > Fix For: 1.4.1, 1.5.0 > > > The flaky tests in ExternalShuffleServiceSuite and > SparkListenerWithClusterSuite will fail if there are not enough executors up > before running the jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900941#comment-14900941 ] Cheng Lian commented on SPARK-8118: --- I believe PARQUET-369 is the root cause of this issue, and unfortunately I couldn't figure out a way that works all the time to fix the Parquet log redirection issue (even through hacky reflection tricks). The best way to fix this issue to stop shading slf4j in parquet-format. > Turn off noisy log output produced by Parquet 1.7.0 > --- > > Key: SPARK-8118 > URL: https://issues.apache.org/jira/browse/SPARK-8118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.5.0 > > > Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust > {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. > A better approach than simply muting these log lines is to redirect Parquet > logs via SLF4J, so that we can handle them consistently. In general these > logs are very useful. Esp. when used to diagnosing Parquet memory issue and > filter push-down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900991#comment-14900991 ] Seth Hendrickson commented on SPARK-10602: -- My branch is here: [SPARK-10641|https://github.com/sethah/spark/tree/SPARK-10641], the implementation is mostly [here|https://github.com/sethah/spark/blob/SPARK-10641/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala] Note that I wrote a few tests but they aren't passing currently (also scalastyle will fail) as this is a WIP. The update rules are based on descriptions [here|https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics]. I also made a pass at implementing a numerically stable update for skewness and kurtosis following the [Kahan update|https://en.wikipedia.org/wiki/Kahan_summation_algorithm]. The implementation follows the algorithms described [here|http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf]. I am not sure that they are working and I haven't been able to find a great way to test for numerical stability but I'll try to get that worked out soon. Also note that the legacy way of implementing aggregates is still required, but I have simply issued placeholders for the {{AggregateFunction1}} implementations for now. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10729) word2vec model save for python
Joseph A Gartner III created SPARK-10729: Summary: word2vec model save for python Key: SPARK-10729 URL: https://issues.apache.org/jira/browse/SPARK-10729 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0, 1.4.1 Reporter: Joseph A Gartner III The ability to save a word2vec model has not been ported to python, and would be extremely useful to have given the long training period. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901043#comment-14901043 ] Andrew Or commented on SPARK-10474: --- I see, thanks for reporting. I think the fix was correct, but it uncovered new problems because previously we would have failed before hitting it. I'll investigate again. > Aggregation failed with unable to acquire memory > > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (SPARK-10723) Add RDD.reduceOption method
[ https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10723. --- Resolution: Won't Fix > Add RDD.reduceOption method > --- > > Key: SPARK-10723 > URL: https://issues.apache.org/jira/browse/SPARK-10723 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Tatsuya Atsumi >Priority: Minor > > h2. Problem > RDD.reduce throws exception if the RDD is empty. > It is appropriate behavior if RDD is expected to be not empty, but if it is > not sure until runtime that the RDD is empty or not, it needs to wrap with > try-catch to call reduce safely. > Example Code > {code} > // This RDD may be empty or not > val rdd: RDD[Int] = originalRdd.filter(_ > 10) > val reduced: Option[Int] = try { > Some(rdd.reduce(_ + _)) > } catch { > // if rdd is empty return None. > case e:UnsupportedOperationException => None > } > {code} > h2. Improvement idea > Scala’s List has reduceOption method, which returns None if List is empty. > If RDD has reduceOption API like Scala’s List, it will become easy to handle > above case. > Example Code > {code} > val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10626) Create a Java friendly method for randomRDD & RandomDataGenerator on RandomRDDs.
[ https://issues.apache.org/jira/browse/SPARK-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10626. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8782 [https://github.com/apache/spark/pull/8782] > Create a Java friendly method for randomRDD & RandomDataGenerator on > RandomRDDs. > > > Key: SPARK-10626 > URL: https://issues.apache.org/jira/browse/SPARK-10626 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Priority: Minor > Fix For: 1.6.0 > > > SPARK-3136 added a large number of functions for creating Java RandomRDDs, > but for people that want to use custom RandomDataGenerators we should make a > Java friendly method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10626) Create a Java friendly method for randomRDD & RandomDataGenerator on RandomRDDs.
[ https://issues.apache.org/jira/browse/SPARK-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10626: -- Assignee: holdenk > Create a Java friendly method for randomRDD & RandomDataGenerator on > RandomRDDs. > > > Key: SPARK-10626 > URL: https://issues.apache.org/jira/browse/SPARK-10626 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 1.6.0 > > > SPARK-3136 added a large number of functions for creating Java RandomRDDs, > but for people that want to use custom RandomDataGenerators we should make a > Java friendly method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901082#comment-14901082 ] Jon Buffington commented on SPARK-5569: --- Are there any work arounds for this limitation? We are unable to track offsets using Kakfa Direct stream and use checkpoints. Our thinking is we need to abandon checkpoints and manage recovery outside of Spark checkpoints while this limitation exists. For reference with 1.4.1, we get the following: {{monospaced}} ... 15/09/21 13:50:53 WARN CheckpointReader: Error reading checkpoint from file file:/tmp/page_view_events_cp/checkpoint-144285750 java.io.IOException: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.OffsetRange ... {{monospaced}} > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040) > at > org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at
[jira] [Comment Edited] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901082#comment-14901082 ] Jon Buffington edited comment on SPARK-5569 at 9/21/15 5:57 PM: Are there any work arounds for this limitation? We are unable to track offsets using Kakfa Direct stream and use checkpoints. Our thinking is we need to abandon checkpoints and manage recovery outside of Spark checkpoints while this limitation exists. For reference with 1.4.1, we get the following: {{ ... 15/09/21 13:50:53 WARN CheckpointReader: Error reading checkpoint from file file:/tmp/page_view_events_cp/checkpoint-144285750 java.io.IOException: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.OffsetRange ... }} was (Author: jon_fuseelements): Are there any work arounds for this limitation? We are unable to track offsets using Kakfa Direct stream and use checkpoints. Our thinking is we need to abandon checkpoints and manage recovery outside of Spark checkpoints while this limitation exists. For reference with 1.4.1, we get the following: {{monospaced}} ... 15/09/21 13:50:53 WARN CheckpointReader: Error reading checkpoint from file file:/tmp/page_view_events_cp/checkpoint-144285750 java.io.IOException: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.OffsetRange ... {{monospaced}} > Checkpoints cannot reference classes defined outside of Spark's assembly > > > Key: SPARK-5569 > URL: https://issues.apache.org/jira/browse/SPARK-5569 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell > > Not sure if this is a bug or a feature, but it's not obvious, so wanted to > create a JIRA to make sure we document this behavior. > First documented by Cody Koeninger: > https://gist.github.com/koeninger/561a61482cd1b5b3600c > {code} > 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from > file file:/var/tmp/cp/checkpoint-142110041.bk > 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file > file:/var/tmp/cp/checkpoint-142110041.bk > java.io.IOException: java.lang.ClassNotFoundException: > org.apache.spark.rdd.kafka.KafkaRDDPartition > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040) > at > org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at >
[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow
[ https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901092#comment-14901092 ] Yin Huai commented on SPARK-10731: -- Looks like the problem is df.collect does not work well with limit. In Scala, {{df.limit(1).rdd.count()}} will also trigger the problem. > The head() implementation of dataframe is very slow > --- > > Key: SPARK-10731 > URL: https://issues.apache.org/jira/browse/SPARK-10731 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 >Reporter: Jerry Lam > Labels: pyspark > > df=sqlContext.read.parquet("someparquetfiles") > df.head() > The above lines take over 15 minutes. It seems the dataframe requires 3 > stages to return the first row. It reads all data (which is about 1 billion > rows) and run Limit twice. The take(1) implementation in the RDD performs > much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10474: -- Summary: TungstenAggregation cannot acquire memory for pointer array after switching to sort-based (was: TungstenAggregation cannot acquire memory for pointer array) > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based
Andrew Or created SPARK-10733: - Summary: TungstenAggregation cannot acquire page after switching to sort-based Key: SPARK-10733 URL: https://issues.apache.org/jira/browse/SPARK-10733 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is uncovered after fixing SPARK-10474. Stack trace: {code} 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8696) Streaming API for Online LDA
[ https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8696: - Summary: Streaming API for Online LDA (was: StreamingLDA) > Streaming API for Online LDA > > > Key: SPARK-8696 > URL: https://issues.apache.org/jira/browse/SPARK-8696 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Streaming LDA can be a natural extension from online LDA. > Yet for now we need to settle down the implementation for LDA prediction, to > support the predictOn method in the streaming version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10474) Aggregation failed with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-10474: --- > Aggregation failed with unable to acquire memory > > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.6.0, 1.5.1 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org