[jira] [Created] (SPARK-15863) Update SQL programming guide for Spark 2.0
Cheng Lian created SPARK-15863: -- Summary: Update SQL programming guide for Spark 2.0 Key: SPARK-15863 URL: https://issues.apache.org/jira/browse/SPARK-15863 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15696) Improve `crosstab` to have a consistent column order
[ https://issues.apache.org/jira/browse/SPARK-15696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15696. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Improve `crosstab` to have a consistent column order > - > > Key: SPARK-15696 > URL: https://issues.apache.org/jira/browse/SPARK-15696 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > Currently, `crosstab` have **random-order** columns obtained by just > `distinct`. Also, the documentation of `crosstab` also shows the result in a > sorted order which is different from the implementation. > {code} > scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, > 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() > +-+---+---+---+ > |key_value| 3| 2| 1| > +-+---+---+---+ > |2| 1| 0| 2| > |1| 0| 1| 1| > |3| 1| 1| 0| > +-+---+---+---+ > scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, > "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", > "value").show() > +-+---+---+---+ > |key_value| c| a| b| > +-+---+---+---+ > |2| 1| 2| 0| > |1| 0| 1| 1| > |3| 1| 0| 1| > +-+---+---+---+ > {code} > This issue explicitly constructs the columns in a sorted order in order to > improve user experience. Also, this implementation gives the same result with > the documentation. > {code} > scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, > 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() > +-+---+---+---+ > |key_value| 1| 2| 3| > +-+---+---+---+ > |2| 2| 0| 1| > |1| 1| 1| 0| > |3| 0| 1| 1| > +-+---+---+---+ > scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, > "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", > "value").show() > +-+---+---+---+ > > |key_value| a| b| c| > +-+---+---+---+ > |2| 2| 0| 1| > |1| 1| 1| 0| > |3| 0| 1| 1| > +-+---+---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15791) NPE in ScalarSubquery
[ https://issues.apache.org/jira/browse/SPARK-15791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15791. - Resolution: Fixed Fix Version/s: 2.0.0 > NPE in ScalarSubquery > - > > Key: SPARK-15791 > URL: https://issues.apache.org/jira/browse/SPARK-15791 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Eric Liang > Fix For: 2.0.0 > > > {code} > Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45) > at > org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291) > at > org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85) > at > org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15842) Add support for socket stream.
[ https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-15842. --- Resolution: Not A Problem > Add support for socket stream. > -- > > Key: SPARK-15842 > URL: https://issues.apache.org/jira/browse/SPARK-15842 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Streaming so far has offset based sources with all the available sources like > file-source and memory-source that do not need additional capabilities to > implement offset for any given range. > Socket stream at OS level has a very tiny buffer. Many message queues have > the ability to keep the message lingering until it is read by the receiver > end. ZeroMQ is one such example. However in the case of socket stream, this > is not supported. > The challenge here would be to implement a way to buffer for a configurable > amount of time and discuss strategies for overflow and underflow. > This JIRA will form the basis for implementing sources which do not have > native support for lingering a message for any amount of time until it is > read. It deals with design doc if necessary and supporting code to implement > such sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15842) Add support for socket stream.
[ https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323837#comment-15323837 ] Prashant Sharma commented on SPARK-15842: - Thank you for making it clear. Actual question I had was, "What if we could give exactly-once guarantees only for a configurable amount of time ?" In some sense, even socket stream can have the concept of per record offset, by introducing some kind of control bit. But certainly, it does not support the features(like replay an arbitrary sequence of past data and so on.) most message queues come built in. Also, having this would require our own mechanism to support end-to-end exactly once guarantees and that is actually non trivial as one would need receiver as a long running thread and then have to worry about their failover etc.. Address challenges like scaling. This certainly puts it at odds with current design of structured streaming. Also, any one who would like to use socket stream, can always deploy kafka or similar message queue as middleware and have all the guarantees that streaming intends to provide. > Add support for socket stream. > -- > > Key: SPARK-15842 > URL: https://issues.apache.org/jira/browse/SPARK-15842 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Streaming so far has offset based sources with all the available sources like > file-source and memory-source that do not need additional capabilities to > implement offset for any given range. > Socket stream at OS level has a very tiny buffer. Many message queues have > the ability to keep the message lingering until it is read by the receiver > end. ZeroMQ is one such example. However in the case of socket stream, this > is not supported. > The challenge here would be to implement a way to buffer for a configurable > amount of time and discuss strategies for overflow and underflow. > This JIRA will form the basis for implementing sources which do not have > native support for lingering a message for any amount of time until it is > read. It deals with design doc if necessary and supporting code to implement > such sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15838) CACHE TABLE AS SELECT should not replace the existing Temp Table
[ https://issues.apache.org/jira/browse/SPARK-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-15838. --- Resolution: Won't Fix > CACHE TABLE AS SELECT should not replace the existing Temp Table > > > Key: SPARK-15838 > URL: https://issues.apache.org/jira/browse/SPARK-15838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if > existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It > looks risky. > Better Error Message When Having Database Name in CACHE TABLE AS SELECT -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT
[ https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323828#comment-15323828 ] Apache Spark commented on SPARK-15862: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13572 > Better Error Message When Having Database Name in CACHE TABLE AS SELECT > --- > > Key: SPARK-15862 > URL: https://issues.apache.org/jira/browse/SPARK-15862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Priority: Minor > > The table name in CACHE TABLE AS SELECT should NOT contain database prefix > like "database.table". Thus, this PR captures this in Parser and outputs a > better error message, instead of reporting the view already exists. > In addition, in this JIRA, we have a few issues that need to be addressed: 1) > refactor the Parser to generate table identifiers instead of returning the > table name string; 2) add test case for caching and uncaching qualified table > names; 3) fix a few test cases that do not drop temp table at the end; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT
[ https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15862: Assignee: Apache Spark > Better Error Message When Having Database Name in CACHE TABLE AS SELECT > --- > > Key: SPARK-15862 > URL: https://issues.apache.org/jira/browse/SPARK-15862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Minor > > The table name in CACHE TABLE AS SELECT should NOT contain database prefix > like "database.table". Thus, this PR captures this in Parser and outputs a > better error message, instead of reporting the view already exists. > In addition, in this JIRA, we have a few issues that need to be addressed: 1) > refactor the Parser to generate table identifiers instead of returning the > table name string; 2) add test case for caching and uncaching qualified table > names; 3) fix a few test cases that do not drop temp table at the end; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT
[ https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15862: Assignee: (was: Apache Spark) > Better Error Message When Having Database Name in CACHE TABLE AS SELECT > --- > > Key: SPARK-15862 > URL: https://issues.apache.org/jira/browse/SPARK-15862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Priority: Minor > > The table name in CACHE TABLE AS SELECT should NOT contain database prefix > like "database.table". Thus, this PR captures this in Parser and outputs a > better error message, instead of reporting the view already exists. > In addition, in this JIRA, we have a few issues that need to be addressed: 1) > refactor the Parser to generate table identifiers instead of returning the > table name string; 2) add test case for caching and uncaching qualified table > names; 3) fix a few test cases that do not drop temp table at the end; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT
Xiao Li created SPARK-15862: --- Summary: Better Error Message When Having Database Name in CACHE TABLE AS SELECT Key: SPARK-15862 URL: https://issues.apache.org/jira/browse/SPARK-15862 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Priority: Minor The table name in CACHE TABLE AS SELECT should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists. In addition, in this JIRA, we have a few issues that need to be addressed: 1) refactor the Parser to generate table identifiers instead of returning the table name string; 2) add test case for caching and uncaching qualified table names; 3) fix a few test cases that do not drop temp table at the end; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15838) CACHE TABLE AS SELECT should not replace the existing Temp Table
[ https://issues.apache.org/jira/browse/SPARK-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-15838: Description: -Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It looks risky.- Better Error Message When Having Database Name in CACHE TABLE AS SELECT was:Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It looks risky. > CACHE TABLE AS SELECT should not replace the existing Temp Table > > > Key: SPARK-15838 > URL: https://issues.apache.org/jira/browse/SPARK-15838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > -Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if > existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It > looks risky.- > Better Error Message When Having Database Name in CACHE TABLE AS SELECT -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15838) CACHE TABLE AS SELECT should not replace the existing Temp Table
[ https://issues.apache.org/jira/browse/SPARK-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-15838: Description: Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It looks risky. Better Error Message When Having Database Name in CACHE TABLE AS SELECT was: -Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It looks risky.- Better Error Message When Having Database Name in CACHE TABLE AS SELECT > CACHE TABLE AS SELECT should not replace the existing Temp Table > > > Key: SPARK-15838 > URL: https://issues.apache.org/jira/browse/SPARK-15838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if > existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It > looks risky. > Better Error Message When Having Database Name in CACHE TABLE AS SELECT -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bowyer updated SPARK-15861: Description: Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine. For instance, lets say we have the following {code} rows = range(25) rows = [rows[i:i+5] for i in range(0, len(rows), 5)] rdd = sc.parallelize(rows) def to_np(data): return np.array(list(data)) rdd.mapPartitions(to_np).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] rdd.mapPartitions(to_np, preservePartitioning=True).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] {code} This basically makes the provided function that did return act like the end user called {code}rdd.map{code} I think that maybe a check should be put in to call {code}inspect.isgeneratorfunction{code} ? was: Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine. For instance, lets say we have the following {code:python} rows = range(25) rows = [rows[i:i+5] for i in range(0, len(rows), 5)] rdd = sc.parallelize(rows) def to_np(data): return np.array(list(data)) rdd.mapPartitions(to_np).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] rdd.mapPartitions(to_np, preservePartitioning=True).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] {code} This basically makes the provided function that did return act like the end user called {code}rdd.map{code} I think that maybe a check should be put in to call {code:python}inspect.isgeneratorfunction{code} ? > pyspark mapPartitions with none generator functions / functors > -- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Greg Bowyer >Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
Greg Bowyer created SPARK-15861: --- Summary: pyspark mapPartitions with none generator functions / functors Key: SPARK-15861 URL: https://issues.apache.org/jira/browse/SPARK-15861 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Reporter: Greg Bowyer Priority: Minor Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine. For instance, lets say we have the following {code:python} rows = range(25) rows = [rows[i:i+5] for i in range(0, len(rows), 5)] rdd = sc.parallelize(rows) def to_np(data): return np.array(list(data)) rdd.mapPartitions(to_np).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] rdd.mapPartitions(to_np, preservePartitioning=True).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] {code} This basically makes the provided function that did return act like the end user called {code}rdd.map{code} I think that maybe a check should be put in to call {code:python}inspect.isgeneratorfunction{code} ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees
[ https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323797#comment-15323797 ] Apache Spark commented on SPARK-15858: -- User 'mhmoudr' has created a pull request for this issue: https://github.com/apache/spark/pull/13590 > "evaluateEachIteration" will fail on trying to run it on a model with 500+ > trees > - > > Key: SPARK-15858 > URL: https://issues.apache.org/jira/browse/SPARK-15858 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 2.0.0 >Reporter: Mahmoud Rawas > > this line: > remappedData.zip(predictionAndError).mapPartitions > causing a stack over flow exception on executors after nearly 300 iterations > also with this number of trees having a var RDD will have needless memory > allocation. > this functionality tested on version 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key
[ https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15825: Assignee: Apache Spark > sort-merge-join gives invalid results when joining on a tupled key > -- > > Key: SPARK-15825 > URL: https://issues.apache.org/jira/browse/SPARK-15825 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: spark 2.0.0-SNAPSHOT >Reporter: Andres Perez >Assignee: Apache Spark > > {noformat} > import org.apache.spark.sql.functions > val left = List("0", "1", "2").toDS() > .map{ k => ((k, 0), "l") } > val right = List("0", "1", "2").toDS() > .map{ k => ((k, 0), "r") } > val result = left.toDF("k", "v").as[((String, Int), String)].alias("left") > .joinWith(right.toDF("k", "v").as[((String, Int), > String)].alias("right"), functions.col("left.k") === > functions.col("right.k"), "inner") > .as[(((String, Int), String), ((String, Int), String))] > {noformat} > When broadcast joins are enabled, we get the expected output: > {noformat} > (((0,0),l),((0,0),r)) > (((1,0),l),((1,0),r)) > (((2,0),l),((2,0),r)) > {noformat} > However, when broadcast joins are disabled (i.e. setting > spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect: > {noformat} > (((2,0),l),((2,-1),)) > (((0,0),l),((0,-313907893),)) > (((1,0),l),((null,-313907893),)) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15822: Assignee: Apache Spark > segmentation violation in o.a.s.unsafe.types.UTF8String with > spark.memory.offHeap.enabled=true > -- > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Assignee: Apache Spark >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key
[ https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15825: Assignee: (was: Apache Spark) > sort-merge-join gives invalid results when joining on a tupled key > -- > > Key: SPARK-15825 > URL: https://issues.apache.org/jira/browse/SPARK-15825 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: spark 2.0.0-SNAPSHOT >Reporter: Andres Perez > > {noformat} > import org.apache.spark.sql.functions > val left = List("0", "1", "2").toDS() > .map{ k => ((k, 0), "l") } > val right = List("0", "1", "2").toDS() > .map{ k => ((k, 0), "r") } > val result = left.toDF("k", "v").as[((String, Int), String)].alias("left") > .joinWith(right.toDF("k", "v").as[((String, Int), > String)].alias("right"), functions.col("left.k") === > functions.col("right.k"), "inner") > .as[(((String, Int), String), ((String, Int), String))] > {noformat} > When broadcast joins are enabled, we get the expected output: > {noformat} > (((0,0),l),((0,0),r)) > (((1,0),l),((1,0),r)) > (((2,0),l),((2,0),r)) > {noformat} > However, when broadcast joins are disabled (i.e. setting > spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect: > {noformat} > (((2,0),l),((2,-1),)) > (((0,0),l),((0,-313907893),)) > (((1,0),l),((null,-313907893),)) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key
[ https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323783#comment-15323783 ] Apache Spark commented on SPARK-15825: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/13589 > sort-merge-join gives invalid results when joining on a tupled key > -- > > Key: SPARK-15825 > URL: https://issues.apache.org/jira/browse/SPARK-15825 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: spark 2.0.0-SNAPSHOT >Reporter: Andres Perez > > {noformat} > import org.apache.spark.sql.functions > val left = List("0", "1", "2").toDS() > .map{ k => ((k, 0), "l") } > val right = List("0", "1", "2").toDS() > .map{ k => ((k, 0), "r") } > val result = left.toDF("k", "v").as[((String, Int), String)].alias("left") > .joinWith(right.toDF("k", "v").as[((String, Int), > String)].alias("right"), functions.col("left.k") === > functions.col("right.k"), "inner") > .as[(((String, Int), String), ((String, Int), String))] > {noformat} > When broadcast joins are enabled, we get the expected output: > {noformat} > (((0,0),l),((0,0),r)) > (((1,0),l),((1,0),r)) > (((2,0),l),((2,0),r)) > {noformat} > However, when broadcast joins are disabled (i.e. setting > spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect: > {noformat} > (((2,0),l),((2,-1),)) > (((0,0),l),((0,-313907893),)) > (((1,0),l),((null,-313907893),)) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15822: Assignee: (was: Apache Spark) > segmentation violation in o.a.s.unsafe.types.UTF8String with > spark.memory.offHeap.enabled=true > -- > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323781#comment-15323781 ] Apache Spark commented on SPARK-15822: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/13589 > segmentation violation in o.a.s.unsafe.types.UTF8String with > spark.memory.offHeap.enabled=true > -- > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees
[ https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15858: Assignee: Apache Spark > "evaluateEachIteration" will fail on trying to run it on a model with 500+ > trees > - > > Key: SPARK-15858 > URL: https://issues.apache.org/jira/browse/SPARK-15858 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 2.0.0 >Reporter: Mahmoud Rawas >Assignee: Apache Spark > > this line: > remappedData.zip(predictionAndError).mapPartitions > causing a stack over flow exception on executors after nearly 300 iterations > also with this number of trees having a var RDD will have needless memory > allocation. > this functionality tested on version 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees
[ https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15858: Assignee: (was: Apache Spark) > "evaluateEachIteration" will fail on trying to run it on a model with 500+ > trees > - > > Key: SPARK-15858 > URL: https://issues.apache.org/jira/browse/SPARK-15858 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 2.0.0 >Reporter: Mahmoud Rawas > > this line: > remappedData.zip(predictionAndError).mapPartitions > causing a stack over flow exception on executors after nearly 300 iterations > also with this number of trees having a var RDD will have needless memory > allocation. > this functionality tested on version 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees
[ https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323777#comment-15323777 ] Apache Spark commented on SPARK-15858: -- User 'mhmoudr' has created a pull request for this issue: https://github.com/apache/spark/pull/13588 > "evaluateEachIteration" will fail on trying to run it on a model with 500+ > trees > - > > Key: SPARK-15858 > URL: https://issues.apache.org/jira/browse/SPARK-15858 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 2.0.0 >Reporter: Mahmoud Rawas > > this line: > remappedData.zip(predictionAndError).mapPartitions > causing a stack over flow exception on executors after nearly 300 iterations > also with this number of trees having a var RDD will have needless memory > allocation. > this functionality tested on version 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15860) Metrics for codegen size and perf
[ https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15860: Assignee: Apache Spark > Metrics for codegen size and perf > - > > Key: SPARK-15860 > URL: https://issues.apache.org/jira/browse/SPARK-15860 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > > We should expose codahale metrics for the codegen source text size and how > long it takes to compile. The size is particularly interesting, since the JVM > does have hard limits on how large methods can get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15860) Metrics for codegen size and perf
[ https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15860: Assignee: (was: Apache Spark) > Metrics for codegen size and perf > - > > Key: SPARK-15860 > URL: https://issues.apache.org/jira/browse/SPARK-15860 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang > > We should expose codahale metrics for the codegen source text size and how > long it takes to compile. The size is particularly interesting, since the JVM > does have hard limits on how large methods can get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15860) Metrics for codegen size and perf
[ https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323761#comment-15323761 ] Apache Spark commented on SPARK-15860: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/13586 > Metrics for codegen size and perf > - > > Key: SPARK-15860 > URL: https://issues.apache.org/jira/browse/SPARK-15860 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang > > We should expose codahale metrics for the codegen source text size and how > long it takes to compile. The size is particularly interesting, since the JVM > does have hard limits on how large methods can get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15860) Metrics for codegen size and perf
Eric Liang created SPARK-15860: -- Summary: Metrics for codegen size and perf Key: SPARK-15860 URL: https://issues.apache.org/jira/browse/SPARK-15860 Project: Spark Issue Type: Improvement Components: SQL Reporter: Eric Liang We should expose codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15850) Remove function grouping in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-15850. --- Resolution: Resolved > Remove function grouping in SparkSession > > > Key: SPARK-15850 > URL: https://issues.apache.org/jira/browse/SPARK-15850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > SparkSession does not have that many functions due to better namespacing, and > as a result we probably don't need the function grouping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15853) HDFSMetadataLog.get leaks the input stream
[ https://issues.apache.org/jira/browse/SPARK-15853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-15853. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13583 [https://github.com/apache/spark/pull/13583] > HDFSMetadataLog.get leaks the input stream > -- > > Key: SPARK-15853 > URL: https://issues.apache.org/jira/browse/SPARK-15853 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > HDFSMetadataLog.get doesn't close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range
[ https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323715#comment-15323715 ] Reynold Xin commented on SPARK-15856: - cc [~koert] > Revert API breaking changes made in DataFrameReader.text and SQLContext.range > - > > Key: SPARK-15856 > URL: https://issues.apache.org/jira/browse/SPARK-15856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > In Spark 2.0, after unifying Datasets and DataFrames, we made two API > breaking changes: > # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of > {{DataFrame}} > # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of > {{DataFrame}} > However, these two changes introduced several inconsistencies and problems: > # {{spark.read.text()}} silently discards partitioned columns when reading a > partitioned table in text format since {{Dataset\[String\]}} only contains a > single field. Users have to use {{spark.read.format("text").load()}} to > workaround this, which is pretty confusing and error-prone. > # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} > (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}. > # When applying typed operations over Datasets returned by {{spark.range()}}, > weird schema changes may happen. Please refer to SPARK-15632 for more details. > Due to these reasons, we decided to revert these two changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323709#comment-15323709 ] Marcelo Vanzin commented on SPARK-15851: It should be simple to fix it to work with whatever that shell is, right? (In case it doesn't already work.) > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323704#comment-15323704 ] Alexander Ulanov commented on SPARK-15851: -- Sorry for confusion, I mean the shell that is "/bin/sh". Windows version of it comes with Git. > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323702#comment-15323702 ] Joseph K. Bradley commented on SPARK-15581: --- Synced some in person around the summit, and posting notes here for a public record. [~mlnick] [~holdenk] [~sethah] [~yuhaoyan] [~yanboliang] [~wangmiao1981] High priority * spark.ml parity ** Multiclass logistic regression ** SVM ** Also: FPM, stats * Python & R expansion * Improving standard testing * Improving MLlib as an API/platform, not just a library of algorithms To discuss * How should we proceed with deep learning within MLlib (vs. in packages)? * Breeze dependency Other features * Imputer * Stratified sampling * Generic bagging Copy more documentation from spark.mllib user guide to spark.ml one. Items for improving MLlib development * Make roadmap JIRA more active; this needs to be updated and curated more strictly to be a more useful guide to contributors. * Be more willing to encourage developers to publish new ML algorithms as Spark packages while still discussing priority on JIRA. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API
[jira] [Assigned] (SPARK-15859) Optimize the Partition Pruning with Disjunction
[ https://issues.apache.org/jira/browse/SPARK-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15859: Assignee: Apache Spark > Optimize the Partition Pruning with Disjunction > --- > > Key: SPARK-15859 > URL: https://issues.apache.org/jira/browse/SPARK-15859 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Assignee: Apache Spark >Priority: Critical > > Currently we can not optimize the partition pruning in disjunction, for > example: > {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15859) Optimize the Partition Pruning with Disjunction
[ https://issues.apache.org/jira/browse/SPARK-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323683#comment-15323683 ] Apache Spark commented on SPARK-15859: -- User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/13585 > Optimize the Partition Pruning with Disjunction > --- > > Key: SPARK-15859 > URL: https://issues.apache.org/jira/browse/SPARK-15859 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Priority: Critical > > Currently we can not optimize the partition pruning in disjunction, for > example: > {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15859) Optimize the Partition Pruning with Disjunction
[ https://issues.apache.org/jira/browse/SPARK-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15859: Assignee: (was: Apache Spark) > Optimize the Partition Pruning with Disjunction > --- > > Key: SPARK-15859 > URL: https://issues.apache.org/jira/browse/SPARK-15859 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Priority: Critical > > Currently we can not optimize the partition pruning in disjunction, for > example: > {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15794) Should truncate toString() of very wide schemas
[ https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-15794. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13537 [https://github.com/apache/spark/pull/13537] > Should truncate toString() of very wide schemas > --- > > Key: SPARK-15794 > URL: https://issues.apache.org/jira/browse/SPARK-15794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.0.0 > > > With very wide tables, e.g. thousands of fields, the output is unreadable and > often causes OOMs due to inefficient string processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15859) Optimize the Partition Pruning with Disjunction
Cheng Hao created SPARK-15859: - Summary: Optimize the Partition Pruning with Disjunction Key: SPARK-15859 URL: https://issues.apache.org/jira/browse/SPARK-15859 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Critical Currently we can not optimize the partition pruning in disjunction, for example: {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15855) dataframe.R example fails with "java.io.IOException: No input paths specified in job"
[ https://issues.apache.org/jira/browse/SPARK-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323667#comment-15323667 ] Shivaram Venkataraman commented on SPARK-15855: --- For the example to work in a distributed setup the input file needs to be in HDFS or in some other distributed storage system. The example is designed to work out of the box on a single machine. > dataframe.R example fails with "java.io.IOException: No input paths specified > in job" > - > > Key: SPARK-15855 > URL: https://issues.apache.org/jira/browse/SPARK-15855 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.1 >Reporter: Yesha Vora > > Steps: > * Install R on all nodes > * Run dataframe.R example. > The example fails in yarn-client and yarn-cluster mode both with below > mentioned error message. > This application fails to find people.json correctly. {{path <- > file.path(Sys.getenv("SPARK_HOME"), > "examples/src/main/resources/people.json")}} > {code} > [xxx@xxx qa]$ sparkR --master yarn-client examples/src/main/r/dataframe.R > Loading required package: methods > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:stats’: > cov, filter, lag, na.omit, predict, sd, var > The following objects are masked from ‘package:base’: > colnames, colnames<-, intersect, rank, rbind, sample, subset, > summary, table, transform > 16/05/24 22:08:21 INFO SparkContext: Running Spark version 1.6.1 > 16/05/24 22:08:21 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/24 22:08:22 INFO SecurityManager: Changing view acls to: hrt_qa > 16/05/24 22:08:22 INFO SecurityManager: Changing modify acls to: hrt_qa > 16/05/24 22:08:22 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hrt_qa); users > with modify permissions: Set(hrt_qa) > 16/05/24 22:08:22 INFO Utils: Successfully started service 'sparkDriver' on > port 35792. > 16/05/24 22:08:23 INFO Slf4jLogger: Slf4jLogger started > 16/05/24 22:08:23 INFO Remoting: Starting remoting > 16/05/24 22:08:23 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkdriveractorsys...@xx.xx.xx.xxx:49771] > 16/05/24 22:08:23 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 49771. > 16/05/24 22:08:23 INFO SparkEnv: Registering MapOutputTracker > 16/05/24 22:08:23 INFO SparkEnv: Registering BlockManagerMaster > 16/05/24 22:08:23 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-ffed73ad-3e67-4ae5-8734-9338136d3721 > 16/05/24 22:08:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 16/05/24 22:08:24 INFO SparkEnv: Registering OutputCommitCoordinator > 16/05/24 22:08:24 INFO Server: jetty-8.y.z-SNAPSHOT > 16/05/24 22:08:24 INFO AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/05/24 22:08:24 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/05/24 22:08:24 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://xx.xx.xx.xxx:4040 > spark.yarn.driver.memoryOverhead is set but does not apply in client mode. > 16/05/24 22:08:25 INFO Client: Requesting a new application from cluster with > 6 NodeManagers > 16/05/24 22:08:25 INFO Client: Verifying our application has not requested > more than the maximum memory capability of the cluster (10240 MB per > container) > 16/05/24 22:08:25 INFO Client: Will allocate AM container, with 896 MB memory > including 384 MB overhead > 16/05/24 22:08:25 INFO Client: Setting up container launch context for our AM > 16/05/24 22:08:25 INFO Client: Setting up the launch environment for our AM > container > 16/05/24 22:08:26 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 16/05/24 22:08:26 INFO Client: Using the spark assembly jar on HDFS because > you are using HDP, > defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar > 16/05/24 22:08:26 INFO Client: Preparing resources for our AM container > 16/05/24 22:08:26 INFO YarnSparkHadoopUtil: getting token for namenode: > hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003 > 16/05/24 22:08:26 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 187 for > hrt_qa on ha-hdfs:mycluster > 16/05/24 22:08:28 INFO metastore: Trying to connect to metastore with URI > thrift://xxx:9083 > 16/05/24 22:08:28 INFO metastore: Connected to metastore. > 16/05/24 22:08:28 INFO YarnSparkHadoopUtil: HBase class not found > java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration > 16/05/24 22:08:28 INFO Client: Using
[jira] [Commented] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"
[ https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323666#comment-15323666 ] Apache Spark commented on SPARK-15509: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/13584 > R MLlib algorithms should support input columns "features" and "label" > -- > > Key: SPARK-15509 > URL: https://issues.apache.org/jira/browse/SPARK-15509 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently in SparkR, when you load a LibSVM dataset using the sqlContext and > then pass it to an MLlib algorithm, the ML wrappers will fail since they will > try to create a "features" column, which conflicts with the existing > "features" column from the LibSVM loader. E.g., using the "mnist" dataset > from LibSVM: > {code} > training <- loadDF(sqlContext, ".../mnist", "libsvm") > model <- naiveBayes(label ~ features, training) > {code} > This fails with: > {code} > 16/05/24 11:52:41 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.IllegalArgumentException: Output column features already exists. > at > org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) > at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca > {code} > The same issue appears for the "label" column once you rename the "features" > column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"
[ https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15509: Assignee: Apache Spark > R MLlib algorithms should support input columns "features" and "label" > -- > > Key: SPARK-15509 > URL: https://issues.apache.org/jira/browse/SPARK-15509 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Currently in SparkR, when you load a LibSVM dataset using the sqlContext and > then pass it to an MLlib algorithm, the ML wrappers will fail since they will > try to create a "features" column, which conflicts with the existing > "features" column from the LibSVM loader. E.g., using the "mnist" dataset > from LibSVM: > {code} > training <- loadDF(sqlContext, ".../mnist", "libsvm") > model <- naiveBayes(label ~ features, training) > {code} > This fails with: > {code} > 16/05/24 11:52:41 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.IllegalArgumentException: Output column features already exists. > at > org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) > at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca > {code} > The same issue appears for the "label" column once you rename the "features" > column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"
[ https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15509: Assignee: (was: Apache Spark) > R MLlib algorithms should support input columns "features" and "label" > -- > > Key: SPARK-15509 > URL: https://issues.apache.org/jira/browse/SPARK-15509 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently in SparkR, when you load a LibSVM dataset using the sqlContext and > then pass it to an MLlib algorithm, the ML wrappers will fail since they will > try to create a "features" column, which conflicts with the existing > "features" column from the LibSVM loader. E.g., using the "mnist" dataset > from LibSVM: > {code} > training <- loadDF(sqlContext, ".../mnist", "libsvm") > model <- naiveBayes(label ~ features, training) > {code} > This fails with: > {code} > 16/05/24 11:52:41 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.IllegalArgumentException: Output column features already exists. > at > org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) > at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca > {code} > The same issue appears for the "label" column once you rename the "features" > column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees
[ https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323662#comment-15323662 ] Mahmoud Rawas commented on SPARK-15858: --- I am working on a solution. > "evaluateEachIteration" will fail on trying to run it on a model with 500+ > trees > - > > Key: SPARK-15858 > URL: https://issues.apache.org/jira/browse/SPARK-15858 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 2.0.0 >Reporter: Mahmoud Rawas > > this line: > remappedData.zip(predictionAndError).mapPartitions > causing a stack over flow exception on executors after nearly 300 iterations > also with this number of trees having a var RDD will have needless memory > allocation. > this functionality tested on version 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15841) [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.
[ https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15841. -- Resolution: Fixed Assignee: Prashant Sharma Fix Version/s: 2.0.0 > [SPARK REPL] REPLSuite has incorrect env set for a couple of tests. > --- > > Key: SPARK-15841 > URL: https://issues.apache.org/jira/browse/SPARK-15841 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Fix For: 2.0.0 > > > In ReplSuite, for a test that can be tested well on just local should not > really have to start a local-cluster. And similarly a test is in-sufficiently > run if it's actually fixing a problem related to a distributed run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees
Mahmoud Rawas created SPARK-15858: - Summary: "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees Key: SPARK-15858 URL: https://issues.apache.org/jira/browse/SPARK-15858 Project: Spark Issue Type: Bug Affects Versions: 1.6.1, 2.0.0 Reporter: Mahmoud Rawas this line: remappedData.zip(predictionAndError).mapPartitions causing a stack over flow exception on executors after nearly 300 iterations also with this number of trees having a var RDD will have needless memory allocation. this functionality tested on version 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range
[ https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15856: --- Description: In Spark 2.0, after unifying Datasets and DataFrames, we made two API breaking changes: # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of {{DataFrame}} # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of {{DataFrame}} However, these two changes introduced several inconsistencies and problems: # {{spark.read.text()}} silently discards partitioned columns when reading a partitioned table in text format since {{Dataset\[String\]}} only contains a single field. Users have to use {{spark.read.format("text").load()}} to workaround this, which is pretty confusing and error-prone. # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}. # When applying typed operations over Datasets returned by {{spark.range()}}, weird schema changes may happen. Please refer to SPARK-15632 for more details. Due to these reasons, we decided to revert these two changes. was: In Spark 2.0, after unifying Datasets and DataFrames, we made two API breaking changes: # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of {{DataFrame}} # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of {{DataFrame}} However, these two changes introduced several inconsistencies and problems: # {{spark.read.text()}} silently discards partitioned columns when reading a partitioned table in text format since {{Dataset\[String\]}} only contains a single field. Users have to use {{spark.read.format("text").load()}} to workaround this, which is pretty confusing and error-prone. # All data source shortcut methods in `DataFrameReader` returns a {{DataFrame}} (aka {{Dataset\[Row\]}} except for {{DataFrameReader.text()}}. # When applying typed operations over Datasets returned by {{spark.range()}}, weird schema changes may happen. Please refer to SPARK-15632 for more details. Due to these reasons, we decided to revert these two changes. > Revert API breaking changes made in DataFrameReader.text and SQLContext.range > - > > Key: SPARK-15856 > URL: https://issues.apache.org/jira/browse/SPARK-15856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > In Spark 2.0, after unifying Datasets and DataFrames, we made two API > breaking changes: > # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of > {{DataFrame}} > # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of > {{DataFrame}} > However, these two changes introduced several inconsistencies and problems: > # {{spark.read.text()}} silently discards partitioned columns when reading a > partitioned table in text format since {{Dataset\[String\]}} only contains a > single field. Users have to use {{spark.read.format("text").load()}} to > workaround this, which is pretty confusing and error-prone. > # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} > (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}. > # When applying typed operations over Datasets returned by {{spark.range()}}, > weird schema changes may happen. Please refer to SPARK-15632 for more details. > Due to these reasons, we decided to revert these two changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15857) Add Caller Context in Spark
[ https://issues.apache.org/jira/browse/SPARK-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323655#comment-15323655 ] Weiqing Yang commented on SPARK-15857: -- I will attach the design doc soon. > Add Caller Context in Spark > --- > > Key: SPARK-15857 > URL: https://issues.apache.org/jira/browse/SPARK-15857 > Project: Spark > Issue Type: New Feature >Reporter: Weiqing Yang > > Hadoop has implemented a feature of log tracing – caller context (Jira: > HDFS-9184 and YARN-4349). The motivation is to better diagnose and understand > how specific applications impacting parts of the Hadoop system and potential > problems they may be creating (e.g. overloading NN). As HDFS mentioned in > HDFS-9184, for a given HDFS operation, it's very helpful to track which upper > level job issues it. The upper level callers may be specific Oozie tasks, MR > jobs, hive queries, Spark jobs. > Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, > HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those > systems invoke HDFS client API and Yarn client API to setup caller context, > and also expose an API to pass in caller context into it. > Lots of Spark applications are running on Yarn/HDFS. Spark can also implement > its caller context via invoking HDFS/Yarn API, and also expose an API to its > upstream applications to set up their caller contexts. In the end, the spark > caller context written into Yarn log / HDFS log can associate with task id, > stage id, job id and app id. That is also very good for Spark users to > identify tasks especially if Spark supports multi-tenant environment in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range
[ https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15856: --- Description: In Spark 2.0, after unifying Datasets and DataFrames, we made two API breaking changes: # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of {{DataFrame}} # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of {{DataFrame}} However, these two changes introduced several inconsistencies and problems: # {{spark.read.text()}} silently discards partitioned columns when reading a partitioned table in text format since {{Dataset\[String\]}} only contains a single field. Users have to use {{spark.read.format("text").load()}} to workaround this, which is pretty confusing and error-prone. # All data source shortcut methods in `DataFrameReader` returns a {{DataFrame}} (aka {{Dataset\[Row\]}} except for {{DataFrameReader.text()}}. # When applying typed operations over Datasets returned by {{spark.range()}}, weird schema changes may happen. Please refer to SPARK-15632 for more details. Due to these reasons, we decided to revert these two changes. > Revert API breaking changes made in DataFrameReader.text and SQLContext.range > - > > Key: SPARK-15856 > URL: https://issues.apache.org/jira/browse/SPARK-15856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian > > In Spark 2.0, after unifying Datasets and DataFrames, we made two API > breaking changes: > # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of > {{DataFrame}} > # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of > {{DataFrame}} > However, these two changes introduced several inconsistencies and problems: > # {{spark.read.text()}} silently discards partitioned columns when reading a > partitioned table in text format since {{Dataset\[String\]}} only contains a > single field. Users have to use {{spark.read.format("text").load()}} to > workaround this, which is pretty confusing and error-prone. > # All data source shortcut methods in `DataFrameReader` returns a > {{DataFrame}} (aka {{Dataset\[Row\]}} except for {{DataFrameReader.text()}}. > # When applying typed operations over Datasets returned by {{spark.range()}}, > weird schema changes may happen. Please refer to SPARK-15632 for more details. > Due to these reasons, we decided to revert these two changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15857) Add Caller Context in Spark
Weiqing Yang created SPARK-15857: Summary: Add Caller Context in Spark Key: SPARK-15857 URL: https://issues.apache.org/jira/browse/SPARK-15857 Project: Spark Issue Type: New Feature Reporter: Weiqing Yang Hadoop has implemented a feature of log tracing – caller context (Jira: HDFS-9184 and YARN-4349). The motivation is to better diagnose and understand how specific applications impacting parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's very helpful to track which upper level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, hive queries, Spark jobs. Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those systems invoke HDFS client API and Yarn client API to setup caller context, and also expose an API to pass in caller context into it. Lots of Spark applications are running on Yarn/HDFS. Spark can also implement its caller context via invoking HDFS/Yarn API, and also expose an API to its upstream applications to set up their caller contexts. In the end, the spark caller context written into Yarn log / HDFS log can associate with task id, stage id, job id and app id. That is also very good for Spark users to identify tasks especially if Spark supports multi-tenant environment in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12447) Only update AM's internal state when executor is successfully launched by NM
[ https://issues.apache.org/jira/browse/SPARK-12447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12447. Resolution: Fixed Assignee: Saisai Shao (was: Apache Spark) Fix Version/s: 2.0.0 > Only update AM's internal state when executor is successfully launched by NM > > > Key: SPARK-12447 > URL: https://issues.apache.org/jira/browse/SPARK-12447 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.0.0 > > > Currently {{YarnAllocator}} will update its managed states like > {{numExecutorsRunning}} after container is allocated but before executor are > successfully launched. > This happened when Spark configuration is wrong (like spark_shuffle > aux-service is not configured in NM occasionally), which makes executor fail > to launch, or NM lost when NMClient is communicated. > In the current implementation, state will also be updated even executor is > failed to launch, this will lead to incorrect state of AM. Also lingering > container will only be release after timeout, this will introduce resource > waste. > So here we should update the states only after executor is correctly > launched, otherwise we should release container ASAP to make it fail fast and > retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range
Cheng Lian created SPARK-15856: -- Summary: Revert API breaking changes made in DataFrameReader.text and SQLContext.range Key: SPARK-15856 URL: https://issues.apache.org/jira/browse/SPARK-15856 Project: Spark Issue Type: Sub-task Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-15822: -- Priority: Blocker (was: Critical) > segmentation violation in o.a.s.unsafe.types.UTF8String with > spark.memory.offHeap.enabled=true > -- > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-15822: -- Description: Executors fail with segmentation violation while running application with spark.memory.offHeap.enabled true spark.memory.offHeap.size 512m {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 4816 C2 org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] {noformat} We initially saw this on IBM java on PowerPC box but is recreatable on linux with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the same code point: {noformat} 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:785) {noformat} was: Executors fail with segmentation violation while running application with spark.memory.offHeap.enabled true spark.memory.offHeap.size 512m # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 4816 C2 org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] We initially saw this on IBM java on PowerPC box but is recreatable on linux with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the same code point: 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323644#comment-15323644 ] Marcelo Vanzin commented on SPARK-15851: bq. "spark-build-info" can be rewritten as a shell script. Not sure what you mean? It is a shell script. Assuming you mean a second script that is a Windows batch script, not sure I'm a big fan of the idea. It's small enough that it shouldn't matter much, though. > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15855) dataframe.R example fails with "java.io.IOException: No input paths specified in job"
Yesha Vora created SPARK-15855: -- Summary: dataframe.R example fails with "java.io.IOException: No input paths specified in job" Key: SPARK-15855 URL: https://issues.apache.org/jira/browse/SPARK-15855 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.6.1 Reporter: Yesha Vora Steps: * Install R on all nodes * Run dataframe.R example. The example fails in yarn-client and yarn-cluster mode both with below mentioned error message. This application fails to find people.json correctly. {{path <- file.path(Sys.getenv("SPARK_HOME"), "examples/src/main/resources/people.json")}} {code} [xxx@xxx qa]$ sparkR --master yarn-client examples/src/main/r/dataframe.R Loading required package: methods Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:base’: colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform 16/05/24 22:08:21 INFO SparkContext: Running Spark version 1.6.1 16/05/24 22:08:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/05/24 22:08:22 INFO SecurityManager: Changing view acls to: hrt_qa 16/05/24 22:08:22 INFO SecurityManager: Changing modify acls to: hrt_qa 16/05/24 22:08:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hrt_qa); users with modify permissions: Set(hrt_qa) 16/05/24 22:08:22 INFO Utils: Successfully started service 'sparkDriver' on port 35792. 16/05/24 22:08:23 INFO Slf4jLogger: Slf4jLogger started 16/05/24 22:08:23 INFO Remoting: Starting remoting 16/05/24 22:08:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkdriveractorsys...@xx.xx.xx.xxx:49771] 16/05/24 22:08:23 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 49771. 16/05/24 22:08:23 INFO SparkEnv: Registering MapOutputTracker 16/05/24 22:08:23 INFO SparkEnv: Registering BlockManagerMaster 16/05/24 22:08:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ffed73ad-3e67-4ae5-8734-9338136d3721 16/05/24 22:08:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 16/05/24 22:08:24 INFO SparkEnv: Registering OutputCommitCoordinator 16/05/24 22:08:24 INFO Server: jetty-8.y.z-SNAPSHOT 16/05/24 22:08:24 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 16/05/24 22:08:24 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/05/24 22:08:24 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://xx.xx.xx.xxx:4040 spark.yarn.driver.memoryOverhead is set but does not apply in client mode. 16/05/24 22:08:25 INFO Client: Requesting a new application from cluster with 6 NodeManagers 16/05/24 22:08:25 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (10240 MB per container) 16/05/24 22:08:25 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/05/24 22:08:25 INFO Client: Setting up container launch context for our AM 16/05/24 22:08:25 INFO Client: Setting up the launch environment for our AM container 16/05/24 22:08:26 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 16/05/24 22:08:26 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar 16/05/24 22:08:26 INFO Client: Preparing resources for our AM container 16/05/24 22:08:26 INFO YarnSparkHadoopUtil: getting token for namenode: hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003 16/05/24 22:08:26 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 187 for hrt_qa on ha-hdfs:mycluster 16/05/24 22:08:28 INFO metastore: Trying to connect to metastore with URI thrift://xxx:9083 16/05/24 22:08:28 INFO metastore: Connected to metastore. 16/05/24 22:08:28 INFO YarnSparkHadoopUtil: HBase class not found java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration 16/05/24 22:08:28 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar 16/05/24 22:08:28 INFO Client: Source and destination file systems are the same. Not copying hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar 16/05/24 22:08:29 INFO Client: Uploading resource file:/usr/hdp/current/spark-client/examples/src/main/r/dataframe.R -> hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003/dataframe.R 16/05/24 22:08:29 INFO Client: Uploading resource
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323638#comment-15323638 ] Alexander Ulanov commented on SPARK-15851: -- I can do that. However, it seems that "spark-build-info" can be rewritten as a shell script. This will remove the need to install bash for Windows users that compile Spark with maven. What do you think? > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15854) Spark History server gets null pointer exception
Yesha Vora created SPARK-15854: -- Summary: Spark History server gets null pointer exception Key: SPARK-15854 URL: https://issues.apache.org/jira/browse/SPARK-15854 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Yesha Vora In Spark2, Spark-History Server is configured to FSHistoryProvider. Spark HS does not show any finished/running applications and gets Null pointer exception. {code} 16/06/03 23:06:40 INFO FsHistoryProvider: Replaying log path: hdfs://xx:8020/spark2-history/application_1464912457462_0002.inprogress 16/06/03 23:06:50 INFO FsHistoryProvider: Replaying log path: hdfs://xx:8020/spark2-history/application_1464912457462_0002 16/06/03 23:08:27 WARN ServletHandler: Error for /api/v1/applications java.lang.NoSuchMethodError: javax.ws.rs.core.Application.getProperties()Ljava/util/Map; at org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:331) at org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:392) at org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177) at org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369) at javax.servlet.GenericServlet.init(GenericServlet.java:244) at org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:616) at org.spark_project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:472) at org.spark_project.jetty.servlet.ServletHolder.ensureInstance(ServletHolder.java:767) at org.spark_project.jetty.servlet.ServletHolder.prepare(ServletHolder.java:752) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) 16/06/03 23:08:33 WARN ServletHandler: /api/v1/applications java.lang.NullPointerException at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323629#comment-15323629 ] Marcelo Vanzin commented on SPARK-15851: Adding "bash" explicitly in the pom should be fine. Need to change the sbt build also. > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15853) HDFSMetadataLog.get leaks the input stream
[ https://issues.apache.org/jira/browse/SPARK-15853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15853: Assignee: Shixiong Zhu (was: Apache Spark) > HDFSMetadataLog.get leaks the input stream > -- > > Key: SPARK-15853 > URL: https://issues.apache.org/jira/browse/SPARK-15853 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > HDFSMetadataLog.get doesn't close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323624#comment-15323624 ] Alexander Ulanov commented on SPARK-15851: -- This does not work because Ant uses Java Process to run executable which returns "not a valid Win32 application". In order to run it, one need to run "bash" and provide bash file as a param. This approach I proposed as a work-around. For more details please refer to: http://stackoverflow.com/questions/20883212/how-can-i-use-ant-exec-to-execute-commands-on-linux > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15853) HDFSMetadataLog.get leaks the input stream
[ https://issues.apache.org/jira/browse/SPARK-15853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15853: Assignee: Apache Spark (was: Shixiong Zhu) > HDFSMetadataLog.get leaks the input stream > -- > > Key: SPARK-15853 > URL: https://issues.apache.org/jira/browse/SPARK-15853 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > HDFSMetadataLog.get doesn't close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15830) Spark application should get hive tokens only when it is required
[ https://issues.apache.org/jira/browse/SPARK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesha Vora updated SPARK-15830: --- Affects Version/s: 1.6.1 > Spark application should get hive tokens only when it is required > - > > Key: SPARK-15830 > URL: https://issues.apache.org/jira/browse/SPARK-15830 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.1 >Reporter: Yesha Vora > > Currently , All spark application try to get Hive tokens (Even if application > does not use them) if Hive is installed on the cluster. > Due to this practice, spark application which does not require Hive fails > when Hive service (metastore) is down for some reason. > Thus, spark should only try to get Hive tokens when required. It should not > fetch hive token if it is not needed by application. > Example : Spark Pi application does not perform any hive related actions. But > Spark Pi application still fails if hive metastore service is down. > {code} > 16/06/08 01:18:42 INFO YarnSparkHadoopUtil: getting token for namenode: > hdfs://xxx:8020/user/xx/.sparkStaging/application_1465347287950_0001 > 16/06/08 01:18:42 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 7 for > xx on xx.xx.xx.xxx:8020 > 16/06/08 01:18:43 INFO metastore: Trying to connect to metastore with URI > thrift://xx.xx.xx.xxx:9090 > 16/06/08 01:18:43 WARN metastore: Failed to connect to the MetaStore Server... > 16/06/08 01:18:43 INFO metastore: Waiting 5 seconds before next connection > attempt. > 16/06/08 01:18:48 INFO metastore: Trying to connect to metastore with URI > thrift://xx.xx.xx.xxx:9090 > 16/06/08 01:18:48 WARN metastore: Failed to connect to the MetaStore Server... > 16/06/08 01:18:48 INFO metastore: Waiting 5 seconds before next connection > attempt. > 16/06/08 01:18:53 INFO metastore: Trying to connect to metastore with URI > thrift://xx.xx.xx.xxx:9090 > 16/06/08 01:18:53 WARN metastore: Failed to connect to the MetaStore Server... > 16/06/08 01:18:53 INFO metastore: Waiting 5 seconds before next connection > attempt. > 16/06/08 01:18:59 WARN Hive: Failed to access metastore. This class should > not accessed in runtime. > org.apache.hadoop.hive.ql.metadata.Hive Exception : java.lang.Runtime > Exception : Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) > at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15853) HDFSMetadataLog.get leaks the input stream
Shixiong Zhu created SPARK-15853: Summary: HDFSMetadataLog.get leaks the input stream Key: SPARK-15853 URL: https://issues.apache.org/jira/browse/SPARK-15853 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu HDFSMetadataLog.get doesn't close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323620#comment-15323620 ] Marcelo Vanzin commented on SPARK-15851: [~tgraves] (and [~Dhruve Ashar]) hah we were talking about it Tuesday. :-) Maybe installing bash (not the cygwing version, something like the one that comes with the git windows installer) would help here? > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15794) Should truncate toString() of very wide schemas
[ https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15794: Assignee: Eric Liang > Should truncate toString() of very wide schemas > --- > > Key: SPARK-15794 > URL: https://issues.apache.org/jira/browse/SPARK-15794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > > With very wide tables, e.g. thousands of fields, the output is unreadable and > often causes OOMs due to inefficient string processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15764) Replace n^2 loop in BindReferences
[ https://issues.apache.org/jira/browse/SPARK-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15764: Description: BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an input, we perform a linear scan over the {{input}} array. Because {{input}} can sometimes be a List, the call to {{input(ordinal).nullable}} can also be {{O( n )}}. Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. was: BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an input, we perform a linear scan over the {{input}} array. Because {{input}} can sometimes be a List, the call to {{input(ordinal).nullable}} can also be {{O(n)}}. Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. > Replace n^2 loop in BindReferences > -- > > Key: SPARK-15764 > URL: https://issues.apache.org/jira/browse/SPARK-15764 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > BindReferences contains a n^2 loop which causes performance issues when > operating over large schemas: to determine the ordinal of an input, we > perform a linear scan over the {{input}} array. Because {{input}} can > sometimes be a List, the call to {{input(ordinal).nullable}} can also be {{O( > n )}}. > Instead of performing a linear scan, we can convert the input into an array > and build a hash map to map from expression ids to ordinals. The greater > up-front cost of the map construction is offset by the fact that an > expression can contain multiple attribute references, so the cost of the map > construction is amortized across a number of lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15794) Should truncate toString() of very wide schemas
[ https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15794: Target Version/s: 2.0.0 > Should truncate toString() of very wide schemas > --- > > Key: SPARK-15794 > URL: https://issues.apache.org/jira/browse/SPARK-15794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > With very wide tables, e.g. thousands of fields, the output is unreadable and > often causes OOMs due to inefficient string processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15764) Replace n^2 loop in BindReferences
[ https://issues.apache.org/jira/browse/SPARK-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15764: Issue Type: Sub-task (was: Improvement) Parent: SPARK-15852 > Replace n^2 loop in BindReferences > -- > > Key: SPARK-15764 > URL: https://issues.apache.org/jira/browse/SPARK-15764 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > BindReferences contains a n^2 loop which causes performance issues when > operating over large schemas: to determine the ordinal of an input, we > perform a linear scan over the {{input}} array. Because {{input}} can > sometimes be a List, the call to {{input(ordinal).nullable}} can also be O(n). > Instead of performing a linear scan, we can convert the input into an array > and build a hash map to map from expression ids to ordinals. The greater > up-front cost of the map construction is offset by the fact that an > expression can contain multiple attribute references, so the cost of the map > construction is amortized across a number of lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15742) Reduce collections allocations in Catalyst tree transformation methods
[ https://issues.apache.org/jira/browse/SPARK-15742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15742: Issue Type: Sub-task (was: Improvement) Parent: SPARK-15852 > Reduce collections allocations in Catalyst tree transformation methods > -- > > Key: SPARK-15742 > URL: https://issues.apache.org/jira/browse/SPARK-15742 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > In Catalyst's TreeNode {{transform}} methods we end up calling > {{productIterator.map(...).toArray()}} in a number of places, which is > slightly inefficient because it needs to allocate and grow ArrayBuilders. > Since we already know the size of the final output ({{productArity}}), we can > simply allocate an array up-front and use a while loop to consume the > iterator and populate the array. > For most workloads, this performance difference is negligible but it does > make a measurable difference in optimizer performance for queries that > operate over very wide schemas (thousands of columns). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15748) Replace inefficient foldLeft() call in PartitionStatistics
[ https://issues.apache.org/jira/browse/SPARK-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15748: Issue Type: Sub-task (was: Improvement) Parent: SPARK-15852 > Replace inefficient foldLeft() call in PartitionStatistics > -- > > Key: SPARK-15748 > URL: https://issues.apache.org/jira/browse/SPARK-15748 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > PartitionStatistics uses foldLeft and list concatenation to flatten an > iterator of lists, but this is extremely inefficient compared to simply doing > flatMap/flatten because it performs many unnecessary object allocations. > Simply replacing this foldLeft by a flatMap results in fair performance gains > when constructing PartitionStatistics instances for tables with many columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15762) Cache Metadata.hashCode and use a singleton for Metadata.empty
[ https://issues.apache.org/jira/browse/SPARK-15762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15762: Issue Type: Sub-task (was: Improvement) Parent: SPARK-15852 > Cache Metadata.hashCode and use a singleton for Metadata.empty > -- > > Key: SPARK-15762 > URL: https://issues.apache.org/jira/browse/SPARK-15762 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > In Spark SQL we should cache Metadata.hashCode and use a singleton for > Metadata.empty since calculating empty metadata hashCodes appears to be an > bottleneck according to certain profiler results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15764) Replace n^2 loop in BindReferences
[ https://issues.apache.org/jira/browse/SPARK-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15764: Description: BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an input, we perform a linear scan over the {{input}} array. Because {{input}} can sometimes be a List, the call to {{input(ordinal).nullable}} can also be {{O(n)}}. Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. was: BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an input, we perform a linear scan over the {{input}} array. Because {{input}} can sometimes be a List, the call to {{input(ordinal).nullable}} can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. > Replace n^2 loop in BindReferences > -- > > Key: SPARK-15764 > URL: https://issues.apache.org/jira/browse/SPARK-15764 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > BindReferences contains a n^2 loop which causes performance issues when > operating over large schemas: to determine the ordinal of an input, we > perform a linear scan over the {{input}} array. Because {{input}} can > sometimes be a List, the call to {{input(ordinal).nullable}} can also be > {{O(n)}}. > Instead of performing a linear scan, we can convert the input into an array > and build a hash map to map from expression ids to ordinals. The greater > up-front cost of the map construction is offset by the fact that an > expression can contain multiple attribute references, so the cost of the map > construction is amortized across a number of lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15794) Should truncate toString() of very wide schemas
[ https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15794: Issue Type: Sub-task (was: Bug) Parent: SPARK-15852 > Should truncate toString() of very wide schemas > --- > > Key: SPARK-15794 > URL: https://issues.apache.org/jira/browse/SPARK-15794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > With very wide tables, e.g. thousands of fields, the output is unreadable and > often causes OOMs due to inefficient string processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15852) Improve query planning performance for wide nested schema
Reynold Xin created SPARK-15852: --- Summary: Improve query planning performance for wide nested schema Key: SPARK-15852 URL: https://issues.apache.org/jira/browse/SPARK-15852 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Eric Liang This tracks a list of issues to improve query planning (and code generation) performance for wide nested schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14321) Reduce date format cost in date functions
[ https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14321. - Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.0.0 > Reduce date format cost in date functions > - > > Key: SPARK-14321 > URL: https://issues.apache.org/jira/browse/SPARK-14321 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.0.0 > > > Currently the code generated is > {noformat} > /* 066 */ UTF8String primitive5 = null; > /* 067 */ if (!isNull4) { > /* 068 */ try { > /* 069 */ primitive5 = UTF8String.fromString(new > java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( > /* 070 */ new java.util.Date(primitive7 * 1000L))); > /* 071 */ } catch (java.lang.Throwable e) { > /* 072 */ isNull4 = true; > /* 073 */ } > /* 074 */ } > {noformat} > Instantiation of SimpleDateFormat is fairly expensive. It can be created on > need basis. > I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-15851: - Fix Version/s: 2.0.0 > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-15851: - Target Version/s: 2.0.0 Fix Version/s: (was: 2.0.0) > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7
Alexander Ulanov created SPARK-15851: Summary: Spark 2.0 does not compile in Windows 7 Key: SPARK-15851 URL: https://issues.apache.org/jira/browse/SPARK-15851 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.0.0 Environment: Windows 7 Reporter: Alexander Ulanov Spark does not compile in Windows 7. "mvn compile" fails on spark-core due to trying to execute a bash script spark-build-info. Work around: 1)Install win-bash and put in path 2)Change line 350 of core/pom.xml Error trace: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "C:\dev\spark\core\..\build\spark-build-info" (in directory "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 application [ERROR] around Ant part .. @ 4:73 in C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15850) Remove function grouping in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15850: Assignee: Reynold Xin (was: Apache Spark) > Remove function grouping in SparkSession > > > Key: SPARK-15850 > URL: https://issues.apache.org/jira/browse/SPARK-15850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > SparkSession does not have that many functions due to better namespacing, and > as a result we probably don't need the function grouping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15850) Remove function grouping in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323560#comment-15323560 ] Apache Spark commented on SPARK-15850: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13582 > Remove function grouping in SparkSession > > > Key: SPARK-15850 > URL: https://issues.apache.org/jira/browse/SPARK-15850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > SparkSession does not have that many functions due to better namespacing, and > as a result we probably don't need the function grouping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15850) Remove function grouping in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15850: Assignee: Apache Spark (was: Reynold Xin) > Remove function grouping in SparkSession > > > Key: SPARK-15850 > URL: https://issues.apache.org/jira/browse/SPARK-15850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > SparkSession does not have that many functions due to better namespacing, and > as a result we probably don't need the function grouping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15850) Remove function grouping in SparkSession
Reynold Xin created SPARK-15850: --- Summary: Remove function grouping in SparkSession Key: SPARK-15850 URL: https://issues.apache.org/jira/browse/SPARK-15850 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8426) Add blacklist mechanism for YARN container allocation
[ https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323522#comment-15323522 ] Kay Ousterhout commented on SPARK-8426: --- Can we merge this with SPARK-8425? This seems to be more general now (or move all of the general stuff to 8425, and leave any YARN-specific stuff, e.g., re-allocating bad containers, here). > Add blacklist mechanism for YARN container allocation > - > > Key: SPARK-8426 > URL: https://issues.apache.org/jira/browse/SPARK-8426 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Reporter: Saisai Shao >Priority: Minor > Attachments: DesignDocforBlacklistMechanism.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15849) FileNotFoundException on _temporary while doing saveAsTable to S3
Sandeep created SPARK-15849: --- Summary: FileNotFoundException on _temporary while doing saveAsTable to S3 Key: SPARK-15849 URL: https://issues.apache.org/jira/browse/SPARK-15849 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Environment: AWS EC2 with spark on yarn and s3 storage Reporter: Sandeep When submitting spark jobs to yarn cluster, I occasionally see these error messages while doing saveAsTable. I have tried doing this with spark.speculation=false, and get the same error. These errors are similar to SPARK-2984, but my jobs are writing to S3(s3n) : Caused by: java.io.FileNotFoundException: File s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist. at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151) ... 42 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13268) SQL Timestamp stored as GMT but toString returns GMT-08:00
[ https://issues.apache.org/jira/browse/SPARK-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323442#comment-15323442 ] Bo Meng commented on SPARK-13268: - Why is this related to Spark? The conversion does not use any Spark function and I think the conversion loses the time zone information along the way. > SQL Timestamp stored as GMT but toString returns GMT-08:00 > -- > > Key: SPARK-13268 > URL: https://issues.apache.org/jira/browse/SPARK-13268 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ilya Ganelin > > There is an issue with how timestamps are displayed/converted to Strings in > Spark SQL. The documentation states that the timestamp should be created in > the GMT time zone, however, if we do so, we see that the output actually > contains a -8 hour offset: > {code} > new > Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli) > res144: java.sql.Timestamp = 2014-12-31 16:00:00.0 > new > Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli) > res145: java.sql.Timestamp = 2015-01-01 00:00:00.0 > {code} > This result is confusing, unintuitive, and introduces issues when converting > from DataFrames containing timestamps to RDDs which are then saved as text. > This has the effect of essentially shifting all dates in a dataset by 1 day. > The suggested fix for this is to update the timestamp toString representation > to either a) Include timezone or b) Correctly display in GMT. > This change may well introduce substantial and insidious bugs so I'm not sure > how best to resolve this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15613) Incorrect days to millis conversion
[ https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323386#comment-15323386 ] Bo Meng commented on SPARK-15613: - Does this only happen to 1.6? I have tried on the latest master and it does not have this issue. > Incorrect days to millis conversion > > > Key: SPARK-15613 > URL: https://issues.apache.org/jira/browse/SPARK-15613 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 > Environment: java version "1.8.0_91" >Reporter: Dmitry Bushev > > There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It > affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, > i.e the conversion of date stored as {{Int}} days from epoch in InternalRow > to {{java.sql.Date}} of Row returned to user. > > The issue can be reproduced with this test (all the following tests are in my > defalut timezone Europe/Moscow): > {code} > scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) > yield days > res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, > 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, > 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, > 14332, 14696, 15060) > {code} > For example, for {{4108}} day of epoch, the correct date should be > {{1981-04-01}} > {code} > scala> DateTimeUtils.toJavaDate(4107) > res25: java.sql.Date = 1981-03-31 > scala> DateTimeUtils.toJavaDate(4108) > res26: java.sql.Date = 1981-03-31 > scala> DateTimeUtils.toJavaDate(4109) > res27: java.sql.Date = 1981-04-02 > {code} > There was previous unsuccessful attempt to work around the problem in > SPARK-11415. It seems that issue involves flaws in java date implementation > and I don't see how it can be fixed without third-party libraries. > I was not able to identify the library of choice for Spark. The following > implementation uses [JSR-310|http://www.threeten.org/] > {code} > def millisToDays(millisUtc: Long): SQLDate = { > val instant = Instant.ofEpochMilli(millisUtc) > val zonedDateTime = instant.atZone(ZoneId.systemDefault) > zonedDateTime.toLocalDate.toEpochDay.toInt > } > def daysToMillis(days: SQLDate): Long = { > val localDate = LocalDate.ofEpochDay(days) > val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault) > zonedDateTime.toInstant.toEpochMilli > } > {code} > that produces correct results: > {code} > scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) > yield days > res37: scala.collection.immutable.IndexedSeq[Int] = Vector() > scala> new java.sql.Date(daysToMillis(4108)) > res36: java.sql.Date = 1981-04-01 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15613) Incorrect days to millis conversion
[ https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323386#comment-15323386 ] Bo Meng edited comment on SPARK-15613 at 6/9/16 9:24 PM: - Does this only happen to 1.6? I have tried on the latest master and it does not have this issue. Have not tried on 1.6. was (Author: bomeng): Does this only happen to 1.6? I have tried on the latest master and it does not have this issue. > Incorrect days to millis conversion > > > Key: SPARK-15613 > URL: https://issues.apache.org/jira/browse/SPARK-15613 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 > Environment: java version "1.8.0_91" >Reporter: Dmitry Bushev > > There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It > affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, > i.e the conversion of date stored as {{Int}} days from epoch in InternalRow > to {{java.sql.Date}} of Row returned to user. > > The issue can be reproduced with this test (all the following tests are in my > defalut timezone Europe/Moscow): > {code} > scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) > yield days > res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, > 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, > 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, > 14332, 14696, 15060) > {code} > For example, for {{4108}} day of epoch, the correct date should be > {{1981-04-01}} > {code} > scala> DateTimeUtils.toJavaDate(4107) > res25: java.sql.Date = 1981-03-31 > scala> DateTimeUtils.toJavaDate(4108) > res26: java.sql.Date = 1981-03-31 > scala> DateTimeUtils.toJavaDate(4109) > res27: java.sql.Date = 1981-04-02 > {code} > There was previous unsuccessful attempt to work around the problem in > SPARK-11415. It seems that issue involves flaws in java date implementation > and I don't see how it can be fixed without third-party libraries. > I was not able to identify the library of choice for Spark. The following > implementation uses [JSR-310|http://www.threeten.org/] > {code} > def millisToDays(millisUtc: Long): SQLDate = { > val instant = Instant.ofEpochMilli(millisUtc) > val zonedDateTime = instant.atZone(ZoneId.systemDefault) > zonedDateTime.toLocalDate.toEpochDay.toInt > } > def daysToMillis(days: SQLDate): Long = { > val localDate = LocalDate.ofEpochDay(days) > val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault) > zonedDateTime.toInstant.toEpochMilli > } > {code} > that produces correct results: > {code} > scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) > yield days > res37: scala.collection.immutable.IndexedSeq[Int] = Vector() > scala> new java.sql.Date(daysToMillis(4108)) > res36: java.sql.Date = 1981-04-01 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14321) Reduce date format cost in date functions
[ https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323384#comment-15323384 ] Apache Spark commented on SPARK-14321: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/13581 > Reduce date format cost in date functions > - > > Key: SPARK-14321 > URL: https://issues.apache.org/jira/browse/SPARK-14321 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Minor > > Currently the code generated is > {noformat} > /* 066 */ UTF8String primitive5 = null; > /* 067 */ if (!isNull4) { > /* 068 */ try { > /* 069 */ primitive5 = UTF8String.fromString(new > java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( > /* 070 */ new java.util.Date(primitive7 * 1000L))); > /* 071 */ } catch (java.lang.Throwable e) { > /* 072 */ isNull4 = true; > /* 073 */ } > /* 074 */ } > {noformat} > Instantiation of SimpleDateFormat is fairly expensive. It can be created on > need basis. > I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver
[ https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323344#comment-15323344 ] Marcelo Vanzin commented on SPARK-14485: bq. I don't think (a) is especially rare: that's the case anytime data is saved to HDFS I didn't mean rare in general, I meant it should be rare to hit this particular case (scheduler thinks the executor is gone *and* a task result arrives later). The normal case is the task result arrives while the executor is still alive, and the change doesn't really touch that case. > Task finished cause fetch failure when its executor has already been removed > by driver > --- > > Key: SPARK-14485 > URL: https://issues.apache.org/jira/browse/SPARK-14485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.5.2 >Reporter: iward >Assignee: iward > Fix For: 2.0.0 > > > Now, when executor is removed by driver with heartbeats timeout, driver will > re-queue the task on this executor and send a kill command to cluster to kill > this executor. > But, in a situation, the running task of this executor is finished and return > result to driver before this executor killed by kill command sent by driver. > At this situation, driver will accept the task finished event and ignore > speculative task and re-queued task. But, as we know, this executor has > removed by driver, the result of this finished task can not save in driver > because the *BlockManagerId* has also removed from *BlockManagerMaster* by > driver. So, the result data of this stage is not complete, and then, it will > cause fetch failure. > For example, the following is the task log: > {noformat} > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing > executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor > 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after > 256015 ms > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing > tasks for 322 from TaskSet 107.0 > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task > 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): > ExecutorLostFailure (executor 322 lost) > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: > 322 (epoch 11) > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: > Trying to remove executor 322 from BlockManagerMaster. > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed > 322 successfully in removeExecutor > {noformat} > {noformat} > 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task > 229.0 in stage 107.0 (TID 10384) in 272315 ms on > BJHC-HERA-16168.hadoop.jd.local (579/700) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring > task-finished event for 229.1 in stage 107.0 because task 229 has already > completed successfully > {noformat} > {noformat} > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 > missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at > mapPartitions at Exchange.scala:137) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task > set 107.1 with 3 tasks > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, > PROCESS_LOCAL, 3745 bytes) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, > 3745 bytes) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, > PROCESS_LOCAL, 3745 bytes) > {noformat} > Driver will check the stage's result is not complete, and submit missing > task, but this time, the next stage has run because previous stage has finish > for its task is all finished although its result is not complete. > {noformat} > 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task > 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): > FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message= > 2015-12-31 04:40:13 INFO > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 11 > 2015-12-31 04:40:13 INFO at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385) > 2015-12-31 04:40:13 INFO at >
[jira] [Issue Comment Deleted] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Taws updated SPARK-15801: -- Comment: was deleted (was: Indeed, should be enough as it is then. ) > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323242#comment-15323242 ] Jonathan Taws commented on SPARK-15801: --- Indeed, should be enough as it is then. > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323239#comment-15323239 ] Jonathan Taws commented on SPARK-15801: --- Indeed, should be enough as it is then. > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323235#comment-15323235 ] Jonathan Taws edited comment on SPARK-15781 at 6/9/16 8:11 PM: --- What are our next steps on this ? CC Andrew or someone who knows standalone ? was (Author: jonathantaws): What are our nextsteps on this ? CC Andrew or someone who knows standalone ? > Misleading deprecated property in standalone cluster configuration > documentation > > > Key: SPARK-15781 > URL: https://issues.apache.org/jira/browse/SPARK-15781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > I am unsure if this is regarded as an issue or not, but in the > [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts] > documentation for the configuration to launch Spark in stand-alone cluster > mode, the following property is documented : > |SPARK_WORKER_INSTANCES| Number of worker instances to run on each > machine (default: 1). You can make this more than 1 if you have have very > large machines and would like multiple Spark worker processes. If you do set > this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores > per worker, or else each worker will try to use all the cores.| > However, once I launch Spark with the spark-submit utility and the property > {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following > deprecated warning : > {code} > 16/06/06 16:38:28 WARN SparkConf: > SPARK_WORKER_INSTANCES was detected (set to '4'). > This is deprecated in Spark 1.0+. > Please instead use: > - ./spark-submit with --num-executors to specify the number of executors > - Or set SPARK_EXECUTOR_INSTANCES > - spark.executor.instances to configure the number of instances in the spark > config. > {code} > Is this regarded as normal practice to have deprecated fields documented in > the documentation ? > I would have preferred to directly know about the --num-executors property > than to have to submit my application and find a deprecated warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323235#comment-15323235 ] Jonathan Taws commented on SPARK-15781: --- What are our nextsteps on this ? CC Andrew or someone who knows standalone ? > Misleading deprecated property in standalone cluster configuration > documentation > > > Key: SPARK-15781 > URL: https://issues.apache.org/jira/browse/SPARK-15781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > I am unsure if this is regarded as an issue or not, but in the > [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts] > documentation for the configuration to launch Spark in stand-alone cluster > mode, the following property is documented : > |SPARK_WORKER_INSTANCES| Number of worker instances to run on each > machine (default: 1). You can make this more than 1 if you have have very > large machines and would like multiple Spark worker processes. If you do set > this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores > per worker, or else each worker will try to use all the cores.| > However, once I launch Spark with the spark-submit utility and the property > {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following > deprecated warning : > {code} > 16/06/06 16:38:28 WARN SparkConf: > SPARK_WORKER_INSTANCES was detected (set to '4'). > This is deprecated in Spark 1.0+. > Please instead use: > - ./spark-submit with --num-executors to specify the number of executors > - Or set SPARK_EXECUTOR_INSTANCES > - spark.executor.instances to configure the number of instances in the spark > config. > {code} > Is this regarded as normal practice to have deprecated fields documented in > the documentation ? > I would have preferred to directly know about the --num-executors property > than to have to submit my application and find a deprecated warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-15811: Shepherd: Davies Liu > UDFs do not work in Spark 2.0-preview built with scala 2.10 > --- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Priority: Critical > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case
[ https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-15848: --- Affects Version/s: 1.6.1 > Spark unable to read partitioned table in avro format and column name in > upper case > --- > > Key: SPARK-15848 > URL: https://issues.apache.org/jira/browse/SPARK-15848 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Zhan Zhang > > If external partitioned Hive tables created in Avro format. > Spark is returning "null" values if columns names are in Uppercase in the > Avro schema. > The same tables return proper data when queried in the Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case
[ https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323195#comment-15323195 ] Zhan Zhang commented on SPARK-15848: cat > file1.csv< file2.csv< val tbl = sqlContext.table("default.avro_table_uppercase"); scala> tbl.show +--+--+-++ |student_id|subject_id|marks|year| +--+--+-++ | null| null| 100|2000| | null| null| 20|2000| | null| null| 160|2000| | null| null| 963|2000| | null| null| 142|2000| | null| null| 430|2000| | null| null| 91|2002| | null| null| 28|2002| | null| null| 16|2002| | null| null| 96|2002| | null| null| 14|2002| | null| null| 43|2002| +--+--+-++ > Spark unable to read partitioned table in avro format and column name in > upper case > --- > > Key: SPARK-15848 > URL: https://issues.apache.org/jira/browse/SPARK-15848 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If external partitioned Hive tables created in Avro format. > Spark is returning "null" values if columns names are in Uppercase in the > Avro schema. > The same tables return proper data when queried in the Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case
Zhan Zhang created SPARK-15848: -- Summary: Spark unable to read partitioned table in avro format and column name in upper case Key: SPARK-15848 URL: https://issues.apache.org/jira/browse/SPARK-15848 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang If external partitioned Hive tables created in Avro format. Spark is returning "null" values if columns names are in Uppercase in the Avro schema. The same tables return proper data when queried in the Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15839) Maven doc JAR generation fails when JAVA_7_HOME is set
[ https://issues.apache.org/jira/browse/SPARK-15839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15839. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13573 [https://github.com/apache/spark/pull/13573] > Maven doc JAR generation fails when JAVA_7_HOME is set > -- > > Key: SPARK-15839 > URL: https://issues.apache.org/jira/browse/SPARK-15839 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > It looks like the nightly Maven snapshots broke after we set JAVA_7_HOME in > the build: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/. > It seems that passing {{-javabootclasspath}} to scalac using > scala-maven-plugin ends up preventing the Scala library classes from being > added to scalac's internal class path, causing compilation errors while > building doc-jars. > There might be a principled fix to this inside of the scala-maven-plugin > itself, but for now I propose that we simply omit the -javabootclasspath > option during Maven doc-jar generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org