[jira] [Updated] (SPARK-20706) Spark-shell not overriding method definition
[ https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Roth updated SPARK-20706: - Description: In the following example, the definition of myMethod is not correctly updated: -- def myMethod() = "first definition" val tmp = myMethod(); val out = tmp println(out) // prints "first definition" def myMethod() = "second definition" // override above myMethod val tmp = myMethod(); val out = tmp println(out) // should be "second definition" but is "first definition" -- So if I-redefine myMethod, the implementation seems not to be updated in this case. I figured out that the second-last statement (val out = tmp) causes this behavior, if this is moved in a separate block, the code works just fine. was: In the following example, the definition of myMethod is not correctly updated: def myMethod() = "first definition" val tmp = myMethod(); val out = tmp println(out) // prints "first definition" def myMethod() = "second definition" // override above myMethod val tmp = myMethod(); val out = tmp println(out) // should be "second definition" but is "first definition" So if I-redefine myMethod, the implementation seems not to be updated in this case. I figured out that the second-last statement (val out = tmp) causes this behavior, if this is moved in a separate block, the code works just fine. > Spark-shell not overriding method definition > > > Key: SPARK-20706 > URL: https://issues.apache.org/jira/browse/SPARK-20706 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 > Environment: Linux, Scala 2.11.8 >Reporter: Raphael Roth > > In the following example, the definition of myMethod is not correctly updated: > -- > def myMethod() = "first definition" > val tmp = myMethod(); val out = tmp > println(out) // prints "first definition" > def myMethod() = "second definition" // override above myMethod > val tmp = myMethod(); val out = tmp > println(out) // should be "second definition" but is "first definition" > -- > So if I-redefine myMethod, the implementation seems not to be updated in this > case. I figured out that the second-last statement (val out = tmp) causes > this behavior, if this is moved in a separate block, the code works just fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20706) Spark-shell not overriding method definition
Raphael Roth created SPARK-20706: Summary: Spark-shell not overriding method definition Key: SPARK-20706 URL: https://issues.apache.org/jira/browse/SPARK-20706 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.0.0 Environment: Linux, Scala 2.11.8 Reporter: Raphael Roth In the following example, the definition of myMethod is not correctly updated: def myMethod() = "first definition" val tmp = myMethod(); val out = tmp println(out) // prints "first definition" def myMethod() = "second definition" // override above myMethod val tmp = myMethod(); val out = tmp println(out) // should be "second definition" but is "first definition" So if I-redefine myMethod, the implementation seems not to be updated in this case. I figured out that the second-last statement (val out = tmp) causes this behavior, if this is moved in a separate block, the code works just fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20705) The sort function can not be used in the master page when you use Firefox or Google Chrome.
[ https://issues.apache.org/jira/browse/SPARK-20705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005977#comment-16005977 ] guoxiaolongzte commented on SPARK-20705: I will work it,thank you. > The sort function can not be used in the master page when you use Firefox or > Google Chrome. > --- > > Key: SPARK-20705 > URL: https://issues.apache.org/jira/browse/SPARK-20705 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20705) The sort function can not be used in the master page when you use Firefox or Google Chrome.
[ https://issues.apache.org/jira/browse/SPARK-20705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005972#comment-16005972 ] Hyukjin Kwon commented on SPARK-20705: -- Please fill the description in the issue. > The sort function can not be used in the master page when you use Firefox or > Google Chrome. > --- > > Key: SPARK-20705 > URL: https://issues.apache.org/jira/browse/SPARK-20705 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20705) The sort function can not be used in the master page when you use Firefox or Google Chrome.
guoxiaolongzte created SPARK-20705: -- Summary: The sort function can not be used in the master page when you use Firefox or Google Chrome. Key: SPARK-20705 URL: https://issues.apache.org/jira/browse/SPARK-20705 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.1, 2.1.0 Reporter: guoxiaolongzte Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20698) =, ==, > is not working as expected when used in sql query
[ https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005955#comment-16005955 ] someshwar kale commented on SPARK-20698: Sorry [~srowen]..Silly question...will follow the ethics next time Thanks!! > =, ==, > is not working as expected when used in sql query > -- > > Key: SPARK-20698 > URL: https://issues.apache.org/jira/browse/SPARK-20698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 > Environment: windows >Reporter: someshwar kale >Priority: Critical > > I have written below spark program- its not working as expected > {code} > package computedBatch; > import org.apache.log4j.Level; > import org.apache.log4j.Logger; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaRDD; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.api.java.function.Function; > import org.apache.spark.sql.DataFrame; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SQLContext; > import org.apache.spark.sql.hive.HiveContext; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > public class ArithmeticIssueTest { > private transient JavaSparkContext javaSparkContext; > private transient SQLContext sqlContext; > public ArithmeticIssueTest() { > Logger.getLogger("org").setLevel(Level.OFF); > Logger.getLogger("akka").setLevel(Level.OFF); > SparkConf conf = new > SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]"); > javaSparkContext = new JavaSparkContext(conf); > sqlContext = new HiveContext(javaSparkContext); > } > public static void main(String[] args) { > ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest(); > arithmeticIssueTest.execute(); > } > private void execute(){ > List data = Arrays.asList( > "a1,1494389759,99.8793003568,325.389705932", > "a1,1494389759,99.9472573803,325.27559502", > "a1,1494389759,99.7887233987,325.334374851", > "a1,1494389759,99.9547800925,325.371537062", > "a1,1494389759,99.8039111691,325.305285877", > "a1,1494389759,99.8342317379,325.24881354", > "a1,1494389759,99.9849449235,325.396678931", > "a1,1494389759,99.9396731311,325.336115345", > "a1,1494389759,99.9320915068,325.242622938", > "a1,1494389759,99.894669,325.320965146", > "a1,1494389759,99.7735359781,325.345168334", > "a1,1494389759,99.9698837734,325.352291407", > "a1,1494389759,99.8418330703,325.296539372", > "a1,1494389759,99.796315751,325.347570632", > "a1,1494389759,99.7811931613,325.351137315", > "a1,1494389759,99.9773765104,325.218131741", > "a1,1494389759,99.8189825201,325.288197381", > "a1,1494389759,99.8115005369,325.282327633", > "a1,1494389759,99.9924539722,325.24048614", > "a1,1494389759,99.9170191204,325.299431664"); > JavaRDD rawData = javaSparkContext.parallelize(data); > List fields = new ArrayList<>(); > fields.add(DataTypes.createStructField("ASSET_ID", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("TIMESTAMP", > DataTypes.LongType, true)); > fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, > true)); > fields.add(DataTypes.createStructField("temperature", > DataTypes.DoubleType, true)); > StructType schema = DataTypes.createStructType(fields); > JavaRDD rowRDD = rawData.map( > (Function) record -> { > String[] fields1 = record.split(","); > return RowFactory.create( > fields1[0].trim(), > Long.parseLong(fields1[1].trim()), > Double.parseDouble(fields1[2].trim()), > Double.parseDouble(fields1[3].trim())); > }); > DataFrame df = sqlContext.createDataFrame(rowRDD, schema); > df.show(false); > df.registerTempTable("x_linkx1087571272_filtered"); > sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, > count(case when x_linkx1087571272_filtered" + > ".temperature=325.0 then 1 else 0 end) AS > xsumptionx1582594572, max(x_linkx1087571272_filtered" + >
[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005904#comment-16005904 ] Liang-Chi Hsieh commented on SPARK-12225: - Without knowing this issue, I've implemented a {{withColumns}} API in Dataset in SPARK-20542. It benefits ML usage a lot and gets better performance results. For ML pipelines which can chain dozens of stages, if we do withColumn in each stage, the total cost grows big fast. > Support adding or replacing multiple columns at once in DataFrame API > - > > Key: SPARK-12225 > URL: https://issues.apache.org/jira/browse/SPARK-12225 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.2 >Reporter: Sun Rui > > Currently, withColumn() method of DataFrame supports adding or replacing only > single column. It would be convenient to support adding or replacing multiple > columns at once. > Also withColumnRenamed() supports renaming only single column.It would also > be convenient to support renaming multiple columns at once. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20704) CRAN test should run single threaded
[ https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20704: Assignee: Apache Spark > CRAN test should run single threaded > > > Key: SPARK-20704 > URL: https://issues.apache.org/jira/browse/SPARK-20704 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20704) CRAN test should run single threaded
[ https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20704: Assignee: (was: Apache Spark) > CRAN test should run single threaded > > > Key: SPARK-20704 > URL: https://issues.apache.org/jira/browse/SPARK-20704 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20704) CRAN test should run single threaded
[ https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005899#comment-16005899 ] Apache Spark commented on SPARK-20704: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17945 > CRAN test should run single threaded > > > Key: SPARK-20704 > URL: https://issues.apache.org/jira/browse/SPARK-20704 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20704) CRAN test should run single threaded
Felix Cheung created SPARK-20704: Summary: CRAN test should run single threaded Key: SPARK-20704 URL: https://issues.apache.org/jira/browse/SPARK-20704 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005869#comment-16005869 ] Wenchen Fan commented on SPARK-20590: - We only prefer internal data source if the given name is a short name like "csv", "json", etc. Using full name still works. > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Hyukjin Kwon > Fix For: 2.2.1, 2.3.0 > > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005864#comment-16005864 ] Felix Cheung commented on SPARK-20590: -- When the user explicitly specifies the package to use, shouldn't that take priority over the internal one? say if there is a better csv implementation exists as a spark package, then right now there is no way to use it. > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Hyukjin Kwon > Fix For: 2.2.1, 2.3.0 > > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Description: seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in other test runs too, always only when running ML tests, it seems {code} Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: Attempted to access garbage collected accumulator 159454 at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) at scala.Option.map(Option.scala:146) at org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) at org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) at org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) at org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) 1 MLlib recommendation algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. {code} {code} java.lang.IllegalStateException: SparkContext has been shutdown at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474) at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173) at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala) at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.api.r.RBackendHandler.hand
[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory
[ https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005844#comment-16005844 ] Hyukjin Kwon commented on SPARK-20228: -- gentle ping [~Ansgar Schulze] > Random Forest instable results depending on spark.executor.memory > - > > Key: SPARK-20228 > URL: https://issues.apache.org/jira/browse/SPARK-20228 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ansgar Schulze > > If I deploy a random forrest modeling with example > spark.executor.memory20480M > I got another result as if i depoy the modeling with > spark.executor.memory6000M > I excpected the same results but different runtimes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20369) pyspark: Dynamic configuration with SparkConf does not work
[ https://issues.apache.org/jira/browse/SPARK-20369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005842#comment-16005842 ] Hyukjin Kwon commented on SPARK-20369: -- I am resolving this as I can't reproduce as above and it looks the reporter is inactive. > pyspark: Dynamic configuration with SparkConf does not work > --- > > Key: SPARK-20369 > URL: https://issues.apache.org/jira/browse/SPARK-20369 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) > and Mac OS X 10.11.6 >Reporter: Matthew McClain >Priority: Minor > > Setting spark properties dynamically in pyspark using SparkConf object does > not work. Here is the code that shows the bug: > --- > from pyspark import SparkContext, SparkConf > def main(): > conf = SparkConf().setAppName("spark-conf-test") \ > .setMaster("local[2]") \ > .set('spark.python.worker.memory',"1g") \ > .set('spark.executor.memory',"3g") \ > .set("spark.driver.maxResultSize","2g") > print "Spark Config values in SparkConf:" > print conf.toDebugString() > sc = SparkContext(conf=conf) > print "Actual Spark Config values:" > print sc.getConf().toDebugString() > if __name__ == "__main__": > main() > --- > Here is the output; none of the config values set in SparkConf are used in > the SparkContext configuration: > Spark Config values in SparkConf: > spark.master=local[2] > spark.executor.memory=3g > spark.python.worker.memory=1g > spark.app.name=spark-conf-test > spark.driver.maxResultSize=2g > 17/04/18 10:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Actual Spark Config values: > spark.app.id=local-1492528885708 > spark.app.name=sandbox.py > spark.driver.host=10.201.26.172 > spark.driver.maxResultSize=4g > spark.driver.port=54657 > spark.executor.id=driver > spark.files=file:/Users/matt.mcclain/dev/datascience-experiments/mmcclain/client_clusters/sandbox.py > spark.master=local[*] > spark.rdd.compress=True > spark.serializer.objectStreamReset=100 > spark.submit.deployMode=client -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20369) pyspark: Dynamic configuration with SparkConf does not work
[ https://issues.apache.org/jira/browse/SPARK-20369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20369. -- Resolution: Cannot Reproduce > pyspark: Dynamic configuration with SparkConf does not work > --- > > Key: SPARK-20369 > URL: https://issues.apache.org/jira/browse/SPARK-20369 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) > and Mac OS X 10.11.6 >Reporter: Matthew McClain >Priority: Minor > > Setting spark properties dynamically in pyspark using SparkConf object does > not work. Here is the code that shows the bug: > --- > from pyspark import SparkContext, SparkConf > def main(): > conf = SparkConf().setAppName("spark-conf-test") \ > .setMaster("local[2]") \ > .set('spark.python.worker.memory',"1g") \ > .set('spark.executor.memory',"3g") \ > .set("spark.driver.maxResultSize","2g") > print "Spark Config values in SparkConf:" > print conf.toDebugString() > sc = SparkContext(conf=conf) > print "Actual Spark Config values:" > print sc.getConf().toDebugString() > if __name__ == "__main__": > main() > --- > Here is the output; none of the config values set in SparkConf are used in > the SparkContext configuration: > Spark Config values in SparkConf: > spark.master=local[2] > spark.executor.memory=3g > spark.python.worker.memory=1g > spark.app.name=spark-conf-test > spark.driver.maxResultSize=2g > 17/04/18 10:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Actual Spark Config values: > spark.app.id=local-1492528885708 > spark.app.name=sandbox.py > spark.driver.host=10.201.26.172 > spark.driver.maxResultSize=4g > spark.driver.port=54657 > spark.executor.id=driver > spark.files=file:/Users/matt.mcclain/dev/datascience-experiments/mmcclain/client_clusters/sandbox.py > spark.master=local[*] > spark.rdd.compress=True > spark.serializer.objectStreamReset=100 > spark.submit.deployMode=client -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20606) ML 2.2 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-20606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005835#comment-16005835 ] Apache Spark commented on SPARK-20606: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/17944 > ML 2.2 QA: Remove deprecated methods for ML > --- > > Key: SPARK-20606 > URL: https://issues.apache.org/jira/browse/SPARK-20606 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.2.0 > > > Remove ML methods we deprecated in 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005823#comment-16005823 ] Imran Rashid commented on SPARK-19354: -- did a bit more searching -- isn't this fixed by SPARK-20217 & SPARK-20358 ? > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelI
[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005817#comment-16005817 ] Thomas Graves commented on SPARK-19354: --- Right from what I've seen not a blacklisting bug. Bug with speculative tasks being marked as failed rather then killed which then leads to the executor being blacklisted. Not sure on the oom part never saw that. > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException >
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005795#comment-16005795 ] Marcelo Vanzin commented on SPARK-20608: You still haven't understood what I'm saying. You should *NOT* be putting namenode addresses in {{spark.yarn.access.namenodes}} if you're using HA. You should be putting the namespace address. Simple. If that doesn't work, then there's a bug somewhere. But your current fix is wrong. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005789#comment-16005789 ] Apache Spark commented on SPARK-20682: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17943 > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20703) Add an operator for writing data out
[ https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005780#comment-16005780 ] Liang-Chi Hsieh commented on SPARK-20703: - [~rxin] Thanks for ping me. Sure. I'd love to take this. > Add an operator for writing data out > > > Key: SPARK-20703 > URL: https://issues.apache.org/jira/browse/SPARK-20703 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > We should add an operator for writing data out. Right now in the explain plan > / UI there is no way to tell whether a query is writing data out, and also > there is no way to associate metrics with data writes. It'd be tremendously > valuable to do this for adding metrics and for visibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20703) Add an operator for writing data out
Reynold Xin created SPARK-20703: --- Summary: Add an operator for writing data out Key: SPARK-20703 URL: https://issues.apache.org/jira/browse/SPARK-20703 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin We should add an operator for writing data out. Right now in the explain plan / UI there is no way to tell whether a query is writing data out, and also there is no way to associate metrics with data writes. It'd be tremendously valuable to do this for adding metrics and for visibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20703) Add an operator for writing data out
[ https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005765#comment-16005765 ] Reynold Xin commented on SPARK-20703: - cc [~viirya] want to give this a try? > Add an operator for writing data out > > > Key: SPARK-20703 > URL: https://issues.apache.org/jira/browse/SPARK-20703 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > We should add an operator for writing data out. Right now in the explain plan > / UI there is no way to tell whether a query is writing data out, and also > there is no way to associate metrics with data writes. It'd be tremendously > valuable to do this for adding metrics and for visibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005767#comment-16005767 ] Imran Rashid commented on SPARK-19354: -- [~tgraves] I haven't run into this yet -- frankly I still steer users away from speculation for the most part. I don't know of any fix for this. I can see how this messes up blacklisting in particular, but just to make sure I understand right, this isn't a blacklisting bug, right? the problem is that killing speculative tasks has some unintended side effects, right? IIUC, the original task looks like it failed for the wrong reason, and even worse, the entire executor dies, so other tasks running on the executor fail? I don't understand this part: bq. When sorter spill to disk, the task is killed. Then a interruptedExecption is thrown. Then OOM will be thrown how does the interrupted exception lead to an OOM, and killing the executor? I can't see how speculative execution could be used effectively if killing tasks can bring down an executor. > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133
[jira] [Commented] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005757#comment-16005757 ] Yuming Wang commented on SPARK-20200: - Can you check it again? it works for me. {code} build/sbt "test-only org.apache.spark.rdd.LocalCheckpointSuite" {code} > Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite > - > > Key: SPARK-20200 > URL: https://issues.apache.org/jira/browse/SPARK-20200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Priority: Minor > Labels: flaky-test > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message > Error Message > {code} > Collect should have failed if local checkpoint block is removed... > {code} > {code} > org.scalatest.exceptions.TestFailedException: Collect should have failed if > local checkpoint block is removed... > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.fail(Assertions.scala:1328) > at org.scalatest.FunSuite.fail(FunSuite.scala:1555) > at > org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173) > at > org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155) > at > org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > at > org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31) >
[jira] [Assigned] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error
[ https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20702: Assignee: Apache Spark (was: Shixiong Zhu) > TaskContextImpl.markTaskCompleted should not hide the original error > > > Key: SPARK-20702 > URL: https://issues.apache.org/jira/browse/SPARK-20702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > If a TaskCompletionListener throws an error, > TaskContextImpl.markTaskCompleted will hide the original error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error
[ https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20702: Assignee: Shixiong Zhu (was: Apache Spark) > TaskContextImpl.markTaskCompleted should not hide the original error > > > Key: SPARK-20702 > URL: https://issues.apache.org/jira/browse/SPARK-20702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If a TaskCompletionListener throws an error, > TaskContextImpl.markTaskCompleted will hide the original error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error
[ https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005722#comment-16005722 ] Apache Spark commented on SPARK-20702: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17942 > TaskContextImpl.markTaskCompleted should not hide the original error > > > Key: SPARK-20702 > URL: https://issues.apache.org/jira/browse/SPARK-20702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If a TaskCompletionListener throws an error, > TaskContextImpl.markTaskCompleted will hide the original error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error
Shixiong Zhu created SPARK-20702: Summary: TaskContextImpl.markTaskCompleted should not hide the original error Key: SPARK-20702 URL: https://issues.apache.org/jira/browse/SPARK-20702 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1, 2.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu If a TaskCompletionListener throws an error, TaskContextImpl.markTaskCompleted will hide the original error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
[ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20685. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 2.1.2 > BatchPythonEvaluation UDF evaluator fails for case of single UDF with > repeated argument > --- > > Key: SPARK-20685 > URL: https://issues.apache.org/jira/browse/SPARK-20685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > There's a latent corner-case bug in PYSpark UDF evaluation where executing a > stage with a single UDF that takes more than one argument _where that > argument is repeated_ will crash at execution with a confusing error. > Here's a repro: > {code} > from pyspark.sql.types import * > spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType()) > spark.sql("SELECT add(1, 1)").first() > {code} > This fails with > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 180, in main > process() > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 175, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 107, in > func = lambda _, it: map(mapper, it) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 93, in > mapper = lambda a: udf(*a) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 71, in > return lambda *a: f(*a) > TypeError: () takes exactly 2 arguments (1 given) > {code} > The problem was introduced by SPARK-14267: there code there has a fast path > for handling a "batch UDF evaluation consisting of a single Python UDF, but > that branch incorrectly assumes that a single UDF won't have repeated > arguments and therefore skips the code for unpacking arguments from the input > row (whose schema may not necessarily match the UDF inputs). > I have a simple fix for this which I'll submit now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20701) dataframe.show has wrong white space when containing Supplement Unicode character
Pingsan Song created SPARK-20701: Summary: dataframe.show has wrong white space when containing Supplement Unicode character Key: SPARK-20701 URL: https://issues.apache.org/jira/browse/SPARK-20701 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.1.1 Environment: Mac, Hadoop 2.7 prebuild Reporter: Pingsan Song Priority: Trivial The character in the String is \u1D400, repeat 4 times. I guess it would be the same for any supplement unicode character. scala> var testRdd = sc.parallelize(Seq("")).toDF scala> testDF.show(false) ++ |value | ++ || ++ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20684: Assignee: Apache Spark > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Apache Spark > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20684: Assignee: (was: Apache Spark) > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005600#comment-16005600 ] Apache Spark commented on SPARK-20684: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17941 > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005591#comment-16005591 ] David McWhorter edited comment on SPARK-13210 at 5/10/17 10:52 PM: --- I think its the same error, here's the start of the stack trace with assertions disabled: Caused by: java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:301) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:216) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) was (Author: dmcwhorter): I suppose that may be a different error actually... > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005591#comment-16005591 ] David McWhorter commented on SPARK-13210: - I suppose that may be a different error actually... > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005588#comment-16005588 ] David McWhorter commented on SPARK-13210: - [~srowen] Here's the error from Spark 2.1.1 with assertions enabled: [Stage 28:> (0 + 7) / 60][Stage 36:> (7 + 1) / 93][Stage 37:> (0 + 0) / 11]2017-05-10 18:43:05 ERROR TaskContextImpl:91 - Error in TaskCompletionListener java.lang.AssertionError at org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:288) at org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:140) at org.apache.spark.memory.MemoryConsumer.freeArray(MemoryConsumer.java:110) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:166) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:329) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:169) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:97) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:95) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:95) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-05-10 18:43:06 ERROR Executor:91 - Exception in task 1.0 in stage 28.0 (TID 465) org.apache.spark.util.TaskCompletionListenerException at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [Stage 28:> (0 + 8) / 60][Stage 36:> (7 + 1) / 93][Stage 37:> (0 + 0) / 11]2017-05-10 18:43:06 WARN TaskSetManager:66 - Lost task 1.0 in stage 28.0 (TID 465, localhost, executor driver): org.apache.spark.util.TaskCompletionListenerException at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-05-10 18:43:06 ERROR TaskSetManager:70 - Task 1 in stage 28.0 failed 1 times; aborting job > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeEx
[jira] [Commented] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005553#comment-16005553 ] Dongjoon Hyun commented on SPARK-20684: --- Thank you for confirming! > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-20684: --- Summary: expose createGlobalTempView and dropGlobalTempView in SparkR (was: expose createGlobalTempView in SparkR) > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005552#comment-16005552 ] Hossein Falaki commented on SPARK-20684: Yes I agree. > expose createGlobalTempView in SparkR > - > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005545#comment-16005545 ] Dongjoon Hyun commented on SPARK-20684: --- We need `dropGlobalTempView`, too. > expose createGlobalTempView in SparkR > - > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005527#comment-16005527 ] Dongjoon Hyun commented on SPARK-20684: --- Hi, [~falaki]. I'll make a PR for this. > expose createGlobalTempView in SparkR > - > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)
[ https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20700: --- Description: The following (complicated) query eventually fails with a stack overflow during optimization: {code} CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', TIMESTAMP('2015-01-14 00:00:00.0'), '947'), ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', TIMESTAMP('1999-08-15 00:00:00.0'), '437'), ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', TIMESTAMP('1991-05-23 00:00:00.0'), '630'), ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'), ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'), ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', CAST(NULL AS TIMESTAMP), '-740'), ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL AS TIMESTAMP), CAST(NULL AS STRING)), ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', CAST(NULL AS TIMESTAMP), '181'), ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', TIMESTAMP('2016-06-30 00:00:00.0'), '487'), ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS STRING), CAST(NULL AS TIMESTAMP), '-62'); CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null); SELECT AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS float_col, COUNT(t1.smallint_col_2) AS int_col FROM table_5 t1 INNER JOIN ( SELECT (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) AS boolean_col, t2.a, (t1.int_col_4) * (t1.int_col_4) AS int_col FROM table_5 t1 LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4) WHERE (t1.smallint_col_2) > (t1.smallint_col_2) GROUP BY t2.a, (t1.int_col_4) * (t1.int_col_4) HAVING ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), SUM(t1.int_col_4)) ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND ((t2.a) = (t1.smallint_col_2)); {code} (I haven't tried to minimize this failing case yet). Based on sampled jstacks from the driver, it looks like the query might be repeatedly inferring filters from constraints and then pruning those filters. Here's part of the stack at the point where it stackoverflows: {code} [... repeats ...] at org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at org.apache.spark.sq
[jira] [Updated] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)
[ https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20700: --- Summary: InferFiltersFromConstraints stackoverflows for query (v2) (was: Expression canonicalization hits stack overflow for query) > InferFiltersFromConstraints stackoverflows for query (v2) > - > > Key: SPARK-20700 > URL: https://issues.apache.org/jira/browse/SPARK-20700 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen > > The following (complicated) query eventually fails with a stack overflow > during optimization: > {code} > CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, > int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES > ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', > TIMESTAMP('2015-01-14 00:00:00.0'), '947'), > ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', > TIMESTAMP('1999-08-15 00:00:00.0'), '437'), > ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', > TIMESTAMP('1991-05-23 00:00:00.0'), '630'), > ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), > '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'), > ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS > STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'), > ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', > CAST(NULL AS TIMESTAMP), '-740'), > ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL > AS TIMESTAMP), CAST(NULL AS STRING)), > ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', > CAST(NULL AS TIMESTAMP), '181'), > ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', > TIMESTAMP('2016-06-30 00:00:00.0'), '487'), > ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS > STRING), CAST(NULL AS TIMESTAMP), '-62'); > CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null); > SELECT > AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS > float_col, > COUNT(t1.smallint_col_2) AS int_col > FROM table_5 t1 > INNER JOIN ( > SELECT > (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * > (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) > AS boolean_col, > t2.a, > (t1.int_col_4) * (t1.int_col_4) AS int_col > FROM table_5 t1 > LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4) > WHERE > (t1.smallint_col_2) > (t1.smallint_col_2) > GROUP BY > t2.a, > (t1.int_col_4) * (t1.int_col_4) > HAVING > ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), > SUM(t1.int_col_4)) > ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND > ((t2.a) = (t1.smallint_col_2)); > {code} > (I haven't tried to minimize this failing case yet). > Based on sampled jstacks from the driver, it looks like the query might be > repeatedly inferring filters from constraints and then pruning those filters. > Here's part of the stack at the point where it stackoverflows: > {code} > [... repeats ...] > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.
[jira] [Created] (SPARK-20700) Expression canonicalization hits stack overflow for query
Josh Rosen created SPARK-20700: -- Summary: Expression canonicalization hits stack overflow for query Key: SPARK-20700 URL: https://issues.apache.org/jira/browse/SPARK-20700 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 2.2.0 Reporter: Josh Rosen The following (complicated) query eventually fails with a stack overflow during optimization: {code} CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', TIMESTAMP('2015-01-14 00:00:00.0'), '947'), ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', TIMESTAMP('1999-08-15 00:00:00.0'), '437'), ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', TIMESTAMP('1991-05-23 00:00:00.0'), '630'), ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'), ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'), ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', CAST(NULL AS TIMESTAMP), '-740'), ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL AS TIMESTAMP), CAST(NULL AS STRING)), ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', CAST(NULL AS TIMESTAMP), '181'), ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', TIMESTAMP('2016-06-30 00:00:00.0'), '487'), ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS STRING), CAST(NULL AS TIMESTAMP), '-62'); CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null); SELECT AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS float_col, COUNT(t1.smallint_col_2) AS int_col FROM table_5 t1 INNER JOIN ( SELECT (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) AS boolean_col, t2.a, (t1.int_col_4) * (t1.int_col_4) AS int_col FROM table_5 t1 LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4) WHERE (t1.smallint_col_2) > (t1.smallint_col_2) GROUP BY t2.a, (t1.int_col_4) * (t1.int_col_4) HAVING ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), SUM(t1.int_col_4)) ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND ((t2.a) = (t1.smallint_col_2)); {code} (I haven't tried to minimize this failing case yet). Based on sampled jstacks from the driver, it looks like the query might be repeatedly inferring filters from constraints and then pruning those filters. Here's part of the stack at the point where it stackoverflows: {code} [... repeats ...] at org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) at org
[jira] [Assigned] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20504: - Assignee: Weichen Xu (was: Joseph K. Bradley) > ML 2.2 QA: API: Java compatibility, docs > > > Key: SPARK-20504 > URL: https://issues.apache.org/jira/browse/SPARK-20504 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20699) The end of Python stdout/stderr streams may be lost by PythonRunner
Nick Gates created SPARK-20699: -- Summary: The end of Python stdout/stderr streams may be lost by PythonRunner Key: SPARK-20699 URL: https://issues.apache.org/jira/browse/SPARK-20699 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1 Reporter: Nick Gates Priority: Minor The RedirectThread that copies over the Python stdout/err is never joined. And so the PythonRunner may throw an exception before all of the output is copied. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20698) =, ==, > is not working as expected when used in sql query
[ https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20698. --- Resolution: Invalid Fix Version/s: (was: 1.6.2) This isn't a place to ask people to debug your code. You're better off posting a much narrowed down version to StackOverflow, with your data. > =, ==, > is not working as expected when used in sql query > -- > > Key: SPARK-20698 > URL: https://issues.apache.org/jira/browse/SPARK-20698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 > Environment: windows >Reporter: someshwar kale >Priority: Critical > > I have written below spark program- its not working as expected > {code} > package computedBatch; > import org.apache.log4j.Level; > import org.apache.log4j.Logger; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaRDD; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.api.java.function.Function; > import org.apache.spark.sql.DataFrame; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SQLContext; > import org.apache.spark.sql.hive.HiveContext; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > public class ArithmeticIssueTest { > private transient JavaSparkContext javaSparkContext; > private transient SQLContext sqlContext; > public ArithmeticIssueTest() { > Logger.getLogger("org").setLevel(Level.OFF); > Logger.getLogger("akka").setLevel(Level.OFF); > SparkConf conf = new > SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]"); > javaSparkContext = new JavaSparkContext(conf); > sqlContext = new HiveContext(javaSparkContext); > } > public static void main(String[] args) { > ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest(); > arithmeticIssueTest.execute(); > } > private void execute(){ > List data = Arrays.asList( > "a1,1494389759,99.8793003568,325.389705932", > "a1,1494389759,99.9472573803,325.27559502", > "a1,1494389759,99.7887233987,325.334374851", > "a1,1494389759,99.9547800925,325.371537062", > "a1,1494389759,99.8039111691,325.305285877", > "a1,1494389759,99.8342317379,325.24881354", > "a1,1494389759,99.9849449235,325.396678931", > "a1,1494389759,99.9396731311,325.336115345", > "a1,1494389759,99.9320915068,325.242622938", > "a1,1494389759,99.894669,325.320965146", > "a1,1494389759,99.7735359781,325.345168334", > "a1,1494389759,99.9698837734,325.352291407", > "a1,1494389759,99.8418330703,325.296539372", > "a1,1494389759,99.796315751,325.347570632", > "a1,1494389759,99.7811931613,325.351137315", > "a1,1494389759,99.9773765104,325.218131741", > "a1,1494389759,99.8189825201,325.288197381", > "a1,1494389759,99.8115005369,325.282327633", > "a1,1494389759,99.9924539722,325.24048614", > "a1,1494389759,99.9170191204,325.299431664"); > JavaRDD rawData = javaSparkContext.parallelize(data); > List fields = new ArrayList<>(); > fields.add(DataTypes.createStructField("ASSET_ID", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("TIMESTAMP", > DataTypes.LongType, true)); > fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, > true)); > fields.add(DataTypes.createStructField("temperature", > DataTypes.DoubleType, true)); > StructType schema = DataTypes.createStructType(fields); > JavaRDD rowRDD = rawData.map( > (Function) record -> { > String[] fields1 = record.split(","); > return RowFactory.create( > fields1[0].trim(), > Long.parseLong(fields1[1].trim()), > Double.parseDouble(fields1[2].trim()), > Double.parseDouble(fields1[3].trim())); > }); > DataFrame df = sqlContext.createDataFrame(rowRDD, schema); > df.show(false); > df.registerTempTable("x_linkx1087571272_filtered"); > sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, > count(case when x_linkx1087571272_filtered" + > ".temperature=325.0 then 1 else 0 end) AS
[jira] [Updated] (SPARK-20698) =, ==, > is not working as expected when used in sql query
[ https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] someshwar kale updated SPARK-20698: --- Description: I have written below spark program- its not working as expected {code} package computedBatch; import org.apache.log4j.Level; import org.apache.log4j.Logger; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.hive.HiveContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class ArithmeticIssueTest { private transient JavaSparkContext javaSparkContext; private transient SQLContext sqlContext; public ArithmeticIssueTest() { Logger.getLogger("org").setLevel(Level.OFF); Logger.getLogger("akka").setLevel(Level.OFF); SparkConf conf = new SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]"); javaSparkContext = new JavaSparkContext(conf); sqlContext = new HiveContext(javaSparkContext); } public static void main(String[] args) { ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest(); arithmeticIssueTest.execute(); } private void execute(){ List data = Arrays.asList( "a1,1494389759,99.8793003568,325.389705932", "a1,1494389759,99.9472573803,325.27559502", "a1,1494389759,99.7887233987,325.334374851", "a1,1494389759,99.9547800925,325.371537062", "a1,1494389759,99.8039111691,325.305285877", "a1,1494389759,99.8342317379,325.24881354", "a1,1494389759,99.9849449235,325.396678931", "a1,1494389759,99.9396731311,325.336115345", "a1,1494389759,99.9320915068,325.242622938", "a1,1494389759,99.894669,325.320965146", "a1,1494389759,99.7735359781,325.345168334", "a1,1494389759,99.9698837734,325.352291407", "a1,1494389759,99.8418330703,325.296539372", "a1,1494389759,99.796315751,325.347570632", "a1,1494389759,99.7811931613,325.351137315", "a1,1494389759,99.9773765104,325.218131741", "a1,1494389759,99.8189825201,325.288197381", "a1,1494389759,99.8115005369,325.282327633", "a1,1494389759,99.9924539722,325.24048614", "a1,1494389759,99.9170191204,325.299431664"); JavaRDD rawData = javaSparkContext.parallelize(data); List fields = new ArrayList<>(); fields.add(DataTypes.createStructField("ASSET_ID", DataTypes.StringType, true)); fields.add(DataTypes.createStructField("TIMESTAMP", DataTypes.LongType, true)); fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, true)); fields.add(DataTypes.createStructField("temperature", DataTypes.DoubleType, true)); StructType schema = DataTypes.createStructType(fields); JavaRDD rowRDD = rawData.map( (Function) record -> { String[] fields1 = record.split(","); return RowFactory.create( fields1[0].trim(), Long.parseLong(fields1[1].trim()), Double.parseDouble(fields1[2].trim()), Double.parseDouble(fields1[3].trim())); }); DataFrame df = sqlContext.createDataFrame(rowRDD, schema); df.show(false); df.registerTempTable("x_linkx1087571272_filtered"); sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".temperature=325.0 then 1 else 0 end) AS xsumptionx1582594572, max(x_linkx1087571272_filtered" + ".TIMESTAMP) AS eventTime FROM x_linkx1087571272_filtered GROUP BY x_linkx1087571272_filtered" + ".ASSET_ID").show(false); sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".fuel>99.8 then 1 else 0 end) AS xnsumptionx352569416, max(x_linkx1087571272_filtered.TIMESTAMP) AS " + "eventTime FROM x_linkx1087571272_filtered GROUP BY x_linkx1087571272_filtered.ASSET_ID").show(false); //+ sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".temperature==325.0 then 1 else 0 end) AS xsumptionx15825945
[jira] [Updated] (SPARK-20698) =, ==, > is not working as expected when used in sql query
[ https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] someshwar kale updated SPARK-20698: --- Description: I have written below spark program- its not working as expected {code} package computedBatch; import org.apache.log4j.Level; import org.apache.log4j.Logger; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.hive.HiveContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class ArithmeticIssueTest { private transient JavaSparkContext javaSparkContext; private transient SQLContext sqlContext; public ArithmeticIssueTest() { Logger.getLogger("org").setLevel(Level.OFF); Logger.getLogger("akka").setLevel(Level.OFF); SparkConf conf = new SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]"); javaSparkContext = new JavaSparkContext(conf); sqlContext = new HiveContext(javaSparkContext); } public static void main(String[] args) { ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest(); arithmeticIssueTest.execute(); } private void execute(){ List data = Arrays.asList( "a1,1494389759,99.8793003568,325.389705932", "a1,1494389759,99.9472573803,325.27559502", "a1,1494389759,99.7887233987,325.334374851", "a1,1494389759,99.9547800925,325.371537062", "a1,1494389759,99.8039111691,325.305285877", "a1,1494389759,99.8342317379,325.24881354", "a1,1494389759,99.9849449235,325.396678931", "a1,1494389759,99.9396731311,325.336115345", "a1,1494389759,99.9320915068,325.242622938", "a1,1494389759,99.894669,325.320965146", "a1,1494389759,99.7735359781,325.345168334", "a1,1494389759,99.9698837734,325.352291407", "a1,1494389759,99.8418330703,325.296539372", "a1,1494389759,99.796315751,325.347570632", "a1,1494389759,99.7811931613,325.351137315", "a1,1494389759,99.9773765104,325.218131741", "a1,1494389759,99.8189825201,325.288197381", "a1,1494389759,99.8115005369,325.282327633", "a1,1494389759,99.9924539722,325.24048614", "a1,1494389759,99.9170191204,325.299431664"); JavaRDD rawData = javaSparkContext.parallelize(data); List fields = new ArrayList<>(); fields.add(DataTypes.createStructField("ASSET_ID", DataTypes.StringType, true)); fields.add(DataTypes.createStructField("TIMESTAMP", DataTypes.LongType, true)); fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, true)); fields.add(DataTypes.createStructField("temperature", DataTypes.DoubleType, true)); StructType schema = DataTypes.createStructType(fields); JavaRDD rowRDD = rawData.map( (Function) record -> { String[] fields1 = record.split(","); return RowFactory.create( fields1[0].trim(), Long.parseLong(fields1[1].trim()), Double.parseDouble(fields1[2].trim()), Double.parseDouble(fields1[3].trim())); }); DataFrame df = sqlContext.createDataFrame(rowRDD, schema); df.show(false); df.registerTempTable("x_linkx1087571272_filtered"); sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".temperature=325.0 then 1 else 0 end) AS xsumptionx1582594572, max(x_linkx1087571272_filtered" + ".TIMESTAMP) AS eventTime FROM x_linkx1087571272_filtered GROUP BY x_linkx1087571272_filtered" + ".ASSET_ID").show(false); sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".fuel>99.8 then 1 else 0 end) AS xnsumptionx352569416, max(x_linkx1087571272_filtered.TIMESTAMP) AS " + "eventTime FROM x_linkx1087571272_filtered GROUP BY x_linkx1087571272_filtered.ASSET_ID").show(false); //+ sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".temperature==325.0 then 1 else 0 en
[jira] [Created] (SPARK-20698) =, ==, > is not working as expected when used in sql query
someshwar kale created SPARK-20698: -- Summary: =, ==, > is not working as expected when used in sql query Key: SPARK-20698 URL: https://issues.apache.org/jira/browse/SPARK-20698 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Environment: windows Reporter: someshwar kale Priority: Critical Fix For: 1.6.2 I have written below spark program- its not working as expected package computedBatch; import org.apache.log4j.Level; import org.apache.log4j.Logger; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.hive.HiveContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class ArithmeticIssueTest { private transient JavaSparkContext javaSparkContext; private transient SQLContext sqlContext; public ArithmeticIssueTest() { Logger.getLogger("org").setLevel(Level.OFF); Logger.getLogger("akka").setLevel(Level.OFF); SparkConf conf = new SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]"); javaSparkContext = new JavaSparkContext(conf); sqlContext = new HiveContext(javaSparkContext); } public static void main(String[] args) { ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest(); arithmeticIssueTest.execute(); } private void execute(){ List data = Arrays.asList( "a1,1494389759,99.8793003568,325.389705932", "a1,1494389759,99.9472573803,325.27559502", "a1,1494389759,99.7887233987,325.334374851", "a1,1494389759,99.9547800925,325.371537062", "a1,1494389759,99.8039111691,325.305285877", "a1,1494389759,99.8342317379,325.24881354", "a1,1494389759,99.9849449235,325.396678931", "a1,1494389759,99.9396731311,325.336115345", "a1,1494389759,99.9320915068,325.242622938", "a1,1494389759,99.894669,325.320965146", "a1,1494389759,99.7735359781,325.345168334", "a1,1494389759,99.9698837734,325.352291407", "a1,1494389759,99.8418330703,325.296539372", "a1,1494389759,99.796315751,325.347570632", "a1,1494389759,99.7811931613,325.351137315", "a1,1494389759,99.9773765104,325.218131741", "a1,1494389759,99.8189825201,325.288197381", "a1,1494389759,99.8115005369,325.282327633", "a1,1494389759,99.9924539722,325.24048614", "a1,1494389759,99.9170191204,325.299431664"); JavaRDD rawData = javaSparkContext.parallelize(data); List fields = new ArrayList<>(); fields.add(DataTypes.createStructField("ASSET_ID", DataTypes.StringType, true)); fields.add(DataTypes.createStructField("TIMESTAMP", DataTypes.LongType, true)); fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, true)); fields.add(DataTypes.createStructField("temperature", DataTypes.DoubleType, true)); StructType schema = DataTypes.createStructType(fields); JavaRDD rowRDD = rawData.map( (Function) record -> { String[] fields1 = record.split(","); return RowFactory.create( fields1[0].trim(), Long.parseLong(fields1[1].trim()), Double.parseDouble(fields1[2].trim()), Double.parseDouble(fields1[3].trim())); }); DataFrame df = sqlContext.createDataFrame(rowRDD, schema); df.show(false); df.registerTempTable("x_linkx1087571272_filtered"); sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".temperature=325.0 then 1 else 0 end) AS xsumptionx1582594572, max(x_linkx1087571272_filtered" + ".TIMESTAMP) AS eventTime FROM x_linkx1087571272_filtered GROUP BY x_linkx1087571272_filtered" + ".ASSET_ID").show(false); sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case when x_linkx1087571272_filtered" + ".fuel>99.8 then 1 else 0 end) AS xnsumptionx352569416, max(x_linkx1087571272_filtered.TIMESTAMP) AS " + "eventTime FROM x_li
[jira] [Comment Edited] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005189#comment-16005189 ] Thomas Graves edited comment on SPARK-19354 at 5/10/17 6:37 PM: [~irashid] wondering if you have seen the issue with blacklisting or perhaps there is another fix for that already? was (Author: tgraves): [~squito] wondering if you have seen the issue with blacklisting or perhaps there is another fix for that already? > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.
[jira] [Updated] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-19354: -- Priority: Major (was: Minor) > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659) > at > org.apache.hadoop.net.SocketIOWithTimeout.co
[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005189#comment-16005189 ] Thomas Graves commented on SPARK-19354: --- [~squito] wondering if you have seen the issue with blacklisting or perhaps there is another fix for that already? > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java
[jira] [Updated] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-19354: -- Issue Type: Bug (was: Improvement) > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Devaraj K >Priority: Minor > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659) > at > org.apach
[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005182#comment-16005182 ] Thomas Graves commented on SPARK-19354: --- This is definitely causing issues with blacklisting. speculative tasks can come back with failed like: 17/05/10 17:14:38 ERROR Executor: Exception in task 71476.0 in stage 52.0 (TID 317274) java.nio.channels.ClosedByInterruptException Then executor gets marked as blacklisted: https://github.com/apache/spark/blob/branch-2.1/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L793 > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Devaraj K >Priority: Minor > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the attempt as KILLED instead of FAILED. > ||93 ||214 ||1 (speculative) ||FAILED||ANY ||1 / > xx.xx.xx.x2 > stdout > stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400 > ||java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "node2/xx.xx.xx.x2"; destination host is: "node1":9000; > +details|| > {code:xml} > 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in > stage 1.0 (TID 214) > 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID > 214) > java.io.IOException: Failed on local exception: > java.nio.channels.ClosedByInterruptException; Host Details : local host is: > "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) > at org.apache.hadoop.ipc.Client.call(Client.java:1479) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy17.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadP
[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
[ https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005179#comment-16005179 ] Ignacio Bermudez Corrales commented on SPARK-20687: --- Proposing a patch in PR https://github.com/apache/spark/pull/17940 > mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix > > > Key: SPARK-20687 > URL: https://issues.apache.org/jira/browse/SPARK-20687 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Ignacio Bermudez Corrales >Priority: Minor > > Conversion of Breeze sparse matrices to Matrix is broken when matrices are > product of certain operations. This problem I think is caused by the update > method in Breeze CSCMatrix when they add provisional zeros to the data for > efficiency. > This bug is serious and may affect at least BlockMatrix addition and > substraction > http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458 > The following code, reproduces the bug (Check test("breeze conversion bug")) > https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala > {code:title=MatricesSuite.scala|borderStyle=solid} > test("breeze conversion bug") { > // (2, 0, 0) > // (2, 0, 0) > val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), > Array(2, 2)).asBreeze > // (2, 1E-15, 1E-15) > // (2, 1E-15, 1E-15 > val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, > 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze > // The following shouldn't break > val t01 = mat1Brz - mat1Brz > val t02 = mat2Brz - mat2Brz > val t02Brz = Matrices.fromBreeze(t02) > val t01Brz = Matrices.fromBreeze(t01) > val t1Brz = mat1Brz - mat2Brz > val t2Brz = mat2Brz - mat1Brz > // The following ones should break > val t1 = Matrices.fromBreeze(t1Brz) > val t2 = Matrices.fromBreeze(t2Brz) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004607#comment-16004607 ] Zoltan Ivanfi edited comment on SPARK-12297 at 5/10/17 6:26 PM: bq. It'd be great to consider this more holistically and think about alternatives in fixing them As Ryan mentioned, the Parquet community discussed this timestamp incompatibilty problem with the aim of avoiding similar problems in the future. It was decided that the specification needs to include two separate types with well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. (Otherwise implementors would be tempted to misuse the single existing type for storing timestamps of different semantics, as it already happened with the int96 timestamp type). Using these two types, SQL engines will be able to unambiguously store their timestamp type regardless of its semantics. However, the TIMESTAMP type should follow TIMESTAMP WITHOUT TIMEZONE semantics for consistency with other SQL engines. The TIMESTAMP WITH TIMEZONE semantics should be implemented as a new SQL type with a matching name. While this is a nice and clean long-term solution, a short-term fix is also desired until the new types become widely supported and/or to allow dealing with existing data. The commit in question is a part of this short-term fix and it allows getting correct values when reading int96 timestamps, even for data written by other components. bq. it completely changes the behavior of one of the most important data types. A very important aspect of this fix is that it does not change SparkSQL's behavior unless the user sets a table property, so it's a completely safe and non-breaking change. bq. One of the fundamental problem is that Spark treats timestamp as timestamp with timezone, whereas impala treats timestamp as timestamp without timezone. The parquet storage is only a small piece here. The fix only addresses Parquet timestamps indeed. This, however, is intentional and is not a limitation, neither an inconsistency as the problem seems to be specific to Parquet. My understanding is that for other file formats, SparkSQL follows timezone-agnostic (TIMESTAMP WITHOUT TIMEZONE) semantics and my experiments with the CSV and Avro formats seem to confirm this. So using UTC-normalized (TIMESTAMP WITH TIMEZONE) semantics in Parquet is not only incompatible with Impala but is also inconsistent within SparkSQL itself. bq. Also this is not just a Parquet issue. The same issue could happen to all data formats. It is going to be really confusing to have something that only works for Parquet The current behavior of SparkSQL already seems to be different for Parquet than for other formats. The fix allows the user to choose a consistent and less confusing behaviour instead. It also makes Impala, Hive and SparkSQL compatible with each other regarding int96 timestamps. bq. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? Unfortunately that would not suffice. The problem has to addressed in all SQL engines. As of today, Hive and Impala already contains the changes that allow interoperability using the parquet.mr.int96.write.zone table property: * Hive: ** https://github.com/apache/hive/commit/84fdc1c7c8ff0922aa44f829dbfa9659935c503e ** https://github.com/apache/hive/commit/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5 ** https://github.com/apache/hive/commit/2dfcea5a95b7d623484b8be50755b817fbc91ce0 ** https://github.com/apache/hive/commit/78e29fc70dacec498c35dc556dd7403e4c9f48fe * Impala: ** https://github.com/apache/incubator-impala/commit/5803a0b0744ddaee6830d4a1bc8dba8d3f2caa26 was (Author: zi): bq. It'd be great to consider this more holistically and think about alternatives in fixing them As Ryan mentioned, the Parquet community discussed this timestamp incompatibilty problem with the aim of avoiding similar problems in the future. It was decided that the specification needs to include two separate types with well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. (Otherwise implementors would be tempted to misuse the single existing type for storing timestamps of different semantics, as it already happened with the int96 timestamp type). While this is a nice and clean long-term solution, a short-term fix is also desired until the new types become widely supported and/or to allow dealing with existing data. The commit in question is a part of this short-term fix and it allows getting correct values when reading int96 timestamps, even for data written by other components. bq. it completely changes the behavior of one of the most i
[jira] [Assigned] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20504: - Assignee: Joseph K. Bradley > ML 2.2 QA: API: Java compatibility, docs > > > Key: SPARK-20504 > URL: https://issues.apache.org/jira/browse/SPARK-20504 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
[ https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-20697: --- Description: MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore. Steps to reproduce: 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 2) In Hive-CLI issue a desc formatted for the table. # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/partbucket Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1494437467 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:10 Bucket Columns: [a] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, 3) In spark-shell, scala> spark.sql("MSCK REPAIR TABLE partbucket") 4) Back to Hive-CLI desc formatted partbucket; # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket Table Type: MANAGED_TABLE Table Parameters: spark.sql.partitionProvider catalog transient_lastDdlTime 1494437647 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, Further inserts to this table cannot be made in bucketed fashion through Hive. was: MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore. Steps to reproduce: 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 2) In Hive-CLI issue a desc formatted for the table. # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner:
[jira] [Created] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
Abhishek Madav created SPARK-20697: -- Summary: MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables. Key: SPARK-20697 URL: https://issues.apache.org/jira/browse/SPARK-20697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Abhishek Madav MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore. Steps to reproduce: 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 2) In Hive-CLI issue a desc formatted for the table. # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/partbucket Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1494437467 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:10 Bucket Columns: [a] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, 3) In spark-shell, scala> spark.sql("MSCK REPAIR TABLE partbucket") 4) Back to Hive-CLI desc formatted partbucket; # col_name data_type comment a int # Partition Information # col_name data_type comment b int # Detailed Table Information Database: sparkhivebucket Owner: devbld CreateTime: Wed May 10 10:31:07 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket Table Type: MANAGED_TABLE Table Parameters: spark.sql.partitionProvider catalog transient_lastDdlTime 1494437647 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19213) FileSourceScanExec uses SparkSession from HadoopFsRelation creation time instead of the active session at execution time
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski closed SPARK-19213. - Resolution: Won't Fix Not really necessary and can lead to confusing results > FileSourceScanExec uses SparkSession from HadoopFsRelation creation time > instead of the active session at execution time > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the code it looks like we should be using the > sparksession that is currently active hence take the one from spark plan. > However, in case you want share Datasets across SparkSessions that is not > enough since as soon as dataset is executed the queryexecution will have > capture spark session at that point. If we want to share datasets across > users we need to make configurations not fixed upon first execution. I > consider 1st part (using sparksession from logical plan) a bug while the > second (using sparksession active at runtime) an enhancement so that sharing > across sessions is made easier. > For example: > {code} > val df = spark.read.parquet(...) > df.count() > val newSession = spark.newSession() > SparkSession.setActiveSession(newSession) > // (simplest one to try is disable > vectorized reads) > val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still > holds reference to original sparksession and changes don't take effect > {code} > I suggest that it shouldn't be necessary to create a new dataset for changes > to take effect. For most of the plans doing Dataset.ofRows work but this is > not the case for hadoopfsrelation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
[ https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004975#comment-16004975 ] Ignacio Bermudez Corrales commented on SPARK-20687: --- When you try to do operations like addition or subtraction between 2 mllib.distributed.BlockMatrices that store in blocks sparse matrices, these are operated using breeze and then converted back to Matrices again. Sometimes this conversion back produces crashes, even though the resulting matrix is valid, because this method in Matrices.fromBreeze doesn't extract correctly the data hold in CSC breeze matrix. Unfortunately, I'm not able to show some code with block matrices, but I can show you some backtrace. I manually debugged the crashes, and found the culprit, so that's why I posted in the description a quite more simplified snippet that reproduces the error. The snippet that causes the crash in BlockMatrix lines 374-379 {code:title:BlockMatrix.scala:blockMap} } else if (b.isEmpty) { new MatrixBlock((blockRowIndex, blockColIndex), a.head) } else { val result = binMap(a.head.asBreeze, b.head.asBreeze) new MatrixBlock((blockRowIndex, blockColIndex), Matrices.fromBreeze(result)) // <--not able to get results } {code} The trace after the operation between 2 spark block matrices: {code:text} Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 34, localhost, executor driver): java.lang.IllegalArgumentException: requirement failed: The last value of colPtrs must equal the number of elements. values.length: 28, colPtrs.last: 15 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.linalg.SparseMatrix.(Matrices.scala:590) at org.apache.spark.mllib.linalg.SparseMatrix.(Matrices.scala:618) at org.apache.spark.mllib.linalg.Matrices$.fromBreeze(Matrices.scala:995) at org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$10.apply(BlockMatrix.scala:378) at org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$10.apply(BlockMatrix.scala:365) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212) at scala.collection.AbstractIterator.fold(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087) at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2119) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2119) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} > mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix > > > Key: SPARK-20687 > URL: https://issues.apache.org/jira/browse/SPARK-20687 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Ignacio Bermudez Corrales >Priority: Minor > > Conversion of Breeze sparse matrices to Matrix is broken when matrices are > product of certain operations. This problem I think is caused by the update > method in Breeze CSCMatrix when they add provisional zeros to the data for > efficiency. > This bug is serious and may affect at least BlockMatrix addition and > substraction > http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458 > The following code, reproduces the bug (Check test("breeze conversion bug")) > https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala > {code:title=MatricesSuite.scala|borderStyle=solid} > test("breeze conversion bug") { > // (2, 0, 0) > // (2, 0, 0) > val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), > Arra
[jira] [Resolved] (SPARK-20689) python doctest leaking bucketed table
[ https://issues.apache.org/jira/browse/SPARK-20689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20689. - Resolution: Fixed Fix Version/s: 2.3.0 > python doctest leaking bucketed table > - > > Key: SPARK-20689 > URL: https://issues.apache.org/jira/browse/SPARK-20689 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.3.0 > > > When trying to address build test failure in SPARK-20661 we discovered some > tables are unexpectedly left behind causing R tests to fail. While we changed > the R tests to be more resilient, we investigated further to see what was > creating those tables. > It turns out pyspark doctest is calling saveAsTable without ever dropping > them. Since we have separate python tests for bucketed table, and we don't > check for result in doctest, there is really no need to run the doctest -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004911#comment-16004911 ] Herman van Hovell commented on SPARK-20680: --- [~jiangxb] Do you have time to work on this? > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Lantao Jin > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004778#comment-16004778 ] Yuechen Chen commented on SPARK-20608: -- I know what you mean and that's exactly right. But since Spark provide "yarn.spark.access.namenodes" config, Spark may recommend two ways to support saving data to remote HDFS: 1) as you said, by config remote namespace mapping in hdfs-site.xml, and just submit it to Spark without any SparkConf.(may be partly recommended for HA) 2) by config yarn.spark.access.namenodes=remotehdfs.(may support HA not well) For the second way, if standby namenodes is allowed to be include in yarn.spark.access.namenodes, this is easier way to make HA, even though Spark App may still failed if namenode failover during the job of saving to remote HDFS. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
[ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20696. --- Resolution: Invalid This isn't a good place to ask, as it's almost surely a question about your data, not a problem in Spark. user@spark or stackoverflow is better. > tf-idf document clustering with K-means in Apache Spark putting points into > one cluster > --- > > Key: SPARK-20696 > URL: https://issues.apache.org/jira/browse/SPARK-20696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nassir > > I am trying to do the classic job of clustering text documents by > pre-processing, generating tf-idf matrix, and then applying K-means. However, > testing this workflow on the classic 20NewsGroup dataset results in most > documents being clustered into one cluster. (I have initially tried to > cluster all documents from 6 of the 20 groups - so expecting to cluster into > 6 clusters). > I am implementing this in Apache Spark as my purpose is to utilise this > technique on millions of documents. Here is the code written in Pyspark on > Databricks: > #declare path to folder containing 6 of 20 news group categories > path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % > MOUNT_NAME > #read all the text files from the 6 folders. Each entity is an entire > document. > text_files = sc.wholeTextFiles(path).cache() > #convert rdd to dataframe > df = text_files.toDF(["filePath", "document"]).cache() > from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer > #tokenize the document text > tokenizer = Tokenizer(inputCol="document", outputCol="tokens") > tokenized = tokenizer.transform(df).cache() > from pyspark.ml.feature import StopWordsRemover > remover = StopWordsRemover(inputCol="tokens", > outputCol="stopWordsRemovedTokens") > stopWordsRemoved_df = remover.transform(tokenized).cache() > hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", > outputCol="rawFeatures", numFeatures=20) > tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() > idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) > idfModel = idf.fit(tfVectors) > tfIdfVectors = idfModel.transform(tfVectors).cache() > #note that I have also tried to use normalized data, but get the same result > from pyspark.ml.feature import Normalizer > from pyspark.ml.linalg import Vectors > normalizer = Normalizer(inputCol="features", outputCol="normFeatures") > l2NormData = normalizer.transform(tfIdfVectors) > from pyspark.ml.clustering import KMeans > # Trains a KMeans model. > kmeans = KMeans().setK(6).setMaxIter(20) > km_model = kmeans.fit(l2NormData) > clustersTable = km_model.transform(l2NormData) > [output showing most documents get clustered into cluster 0][1] > ID number_of_documents_in_cluster > 0 3024 > 3 5 > 1 3 > 5 2 > 2 2 > 4 1 > As you can see most of my data points get clustered into cluster 0, and I > cannot figure out what I am doing wrong as all the tutorials and code I have > come across online point to using this method. > In addition I have also tried normalizing the tf-idf matrix before K-means > but that also produces the same result. I know cosine distance is a better > measure to use, but I expected using standard K-means in Apache Spark would > provide meaningful results. > Can anyone help with regards to whether I have a bug in my code, or if > something is missing in my data clustering pipeline? > (Question also asked in Stackoverflow before: > http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) > Thank you in advance! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
Nassir created SPARK-20696: -- Summary: tf-idf document clustering with K-means in Apache Spark putting points into one cluster Key: SPARK-20696 URL: https://issues.apache.org/jira/browse/SPARK-20696 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Nassir I am trying to do the classic job of clustering text documents by pre-processing, generating tf-idf matrix, and then applying K-means. However, testing this workflow on the classic 20NewsGroup dataset results in most documents being clustered into one cluster. (I have initially tried to cluster all documents from 6 of the 20 groups - so expecting to cluster into 6 clusters). I am implementing this in Apache Spark as my purpose is to utilise this technique on millions of documents. Here is the code written in Pyspark on Databricks: #declare path to folder containing 6 of 20 news group categories path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % MOUNT_NAME #read all the text files from the 6 folders. Each entity is an entire document. text_files = sc.wholeTextFiles(path).cache() #convert rdd to dataframe df = text_files.toDF(["filePath", "document"]).cache() from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer #tokenize the document text tokenizer = Tokenizer(inputCol="document", outputCol="tokens") tokenized = tokenizer.transform(df).cache() from pyspark.ml.feature import StopWordsRemover remover = StopWordsRemover(inputCol="tokens", outputCol="stopWordsRemovedTokens") stopWordsRemoved_df = remover.transform(tokenized).cache() hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", outputCol="rawFeatures", numFeatures=20) tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) idfModel = idf.fit(tfVectors) tfIdfVectors = idfModel.transform(tfVectors).cache() #note that I have also tried to use normalized data, but get the same result from pyspark.ml.feature import Normalizer from pyspark.ml.linalg import Vectors normalizer = Normalizer(inputCol="features", outputCol="normFeatures") l2NormData = normalizer.transform(tfIdfVectors) from pyspark.ml.clustering import KMeans # Trains a KMeans model. kmeans = KMeans().setK(6).setMaxIter(20) km_model = kmeans.fit(l2NormData) clustersTable = km_model.transform(l2NormData) [output showing most documents get clustered into cluster 0][1] ID number_of_documents_in_cluster 0 3024 3 5 1 3 5 2 2 2 4 1 As you can see most of my data points get clustered into cluster 0, and I cannot figure out what I am doing wrong as all the tutorials and code I have come across online point to using this method. In addition I have also tried normalizing the tf-idf matrix before K-means but that also produces the same result. I know cosine distance is a better measure to use, but I expected using standard K-means in Apache Spark would provide meaningful results. Can anyone help with regards to whether I have a bug in my code, or if something is missing in my data clustering pipeline? (Question also asked in Stackoverflow before: http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) Thank you in advance! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004733#comment-16004733 ] Nick Hryhoriev commented on SPARK-5594: --- Hi, i have the same issue. But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of SPARK-7689. It's already removed. What strange, issue appeared after 3 week works. and reproduced even after restart job. Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used. Stack trace {quote} 2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4 times; aborting job 2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job 149442305 ms.2 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 of broadcast_141 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183) at com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138) at com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16) at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106) at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31) at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198) at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111) at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175) at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 of broadcast_141 at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBloc
[jira] [Updated] (SPARK-20622) Parquet partition discovery for non key=value named directories
[ https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noam Asor updated SPARK-20622: -- Description: h4. Why There are cases where traditional M/R jobs and RDD based Spark jobs writes out partitioned parquet in 'value only' named directories i.e. {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users from leveraging Spark SQL parquet partition discovery when reading the former back. h4. What This issue is a proposal for a solution which will allow Spark SQL to discover parquet partitions for 'value only' named directories. h4. How By introducing a new Spark SQL read option *partitionTemplate*. *partitionTemplate* is in a Path form and it should include base path followed by the missing 'key=' as a template for transforming 'value only' named dirs to 'key=value' named dirs. In the example above this will look like: {{hdfs:///some/base/path/year=/month=/day=/}}. To simplify the solution this option should be tied with *basePath* option, meaning that *partitionTemplate* option is valid only if *basePath* is set also. In the end for the above scenario, this will look something like: {code} spark.read .option("basePath", "hdfs:///some/base/path") .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/") .parquet(...) {code} which will allow Spark SQL to do parquet partition discovery on the following directory tree: {code} some |--base |--path |--2016 |--... |--2017 |--01 |--02 |--... |--15 |--... |--... {code} adding to the schema of the resulted DataFrame the columns year, month, day and their respective values as expected. was: h4. Why There are cases where traditional M/R jobs and RDD based Spark jobs writes out partitioned parquet in 'value only' named directories i.e. {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users from leveraging Spark SQL parquet partition discovery when reading the former back. h4. What This issue is a proposal for a solution which will allow Spark SQL to discover parquet partitions for 'value only' named directories. h4. how By introducing a new Spark SQL read option *partitionTemplate*. *partitionTemplate* is in a Path form and it should include base path followed by the missing 'key=' as a template for transforming 'value only' named dirs to 'key=value' named dirs. In the example above this will look like: {{hdfs:///some/base/path/year=/month=/day=/}}. To simplify the solution this option should be tied with *basePath* option, meaning that *partitionTemplate* option is valid only if *basePath* is set also. In the end for the above scenario, this will look something like: {code} spark.read .option("basePath", "hdfs:///some/base/path") .option("basePath", "hdfs:///some/base/path/year=/month=/day=/") .parquet(...) {code} which will allow Spark SQL to do parquet partition discovery on the following directory tree: {code} some |--base |--path |--2016 |--... |--2017 |--01 |--02 |--... |--15 |--... |--... {code} adding to the schema of the resulted DataFrame the columns year, month, day and their respective values as expected. > Parquet partition discovery for non key=value named directories > --- > > Key: SPARK-20622 > URL: https://issues.apache.org/jira/browse/SPARK-20622 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Noam Asor > > h4. Why > There are cases where traditional M/R jobs and RDD based Spark jobs writes > out partitioned parquet in 'value only' named directories i.e. > {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named > directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which > prevents users from leveraging Spark SQL parquet partition discovery when > reading the former back. > h4. What > This issue is a proposal for a solution which will allow Spark SQL to > discover parquet partitions for 'value only' named directories. > h4. How > By introducing a new Spark SQL read option *partitionTemplate*. > *partitionTemplate* is in a Path form and it should include base path > followed by the missing 'key=' as a template for transforming 'value only' > named dirs to 'key=value' named dirs. In the example above this will look > lik
[jira] [Updated] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error
[ https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Mead updated SPARK-20695: --- Do you guys ever read the issues?This is a simple spark shell script (dse -u -p yyy spark) which works perfectly fine if I only have a single text socket stream BUT fails immediately as soon as I intoduce a second socket text stream even if I never reference it. As for registering classes I have no idea what class 13994 is!!Not very helpfull! Original message From: "Sean Owen (JIRA)" Date: 10/05/2017 15:40 (GMT+01:00) To: pjm...@blueyonder.co.uk Subject: [jira] [Resolved] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error [ https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20695. --- Resolution: Invalid I don't believe that's anything to do with TCP; you are enabling Kryo registration but didn't register some class you are serializing. This is a question about debugging your app and shouldn't be a Spark JIRA. You need to read http://spark.apache.org/contributing.html too; you would never set Blocker for example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) > Running multiple TCP socket streams in Spark Shell causes driver error > -- > > Key: SPARK-20695 > URL: https://issues.apache.org/jira/browse/SPARK-20695 > Project: Spark > Issue Type: Bug > Components: DStreams, Spark Core, Spark Shell, Structured Streaming >Affects Versions: 2.0.2 > Environment: DataStax DSE apache 3 node cassandra running with > analytics on RHEL 7.3 on Hyper-V windows 10 laptop. >Reporter: Peter Mead >Priority: Blocker > > Whenever I include a second socket stream (lines02) the script errors if I am > not trying to process data. If I remove the lines02 the script runs fine!! > script: > val s_server01="192.168.1.10" > val s_port01 = 8801 > val s_port02 = 8802 > import org.apache.spark.streaming._, > org.apache.spark.streaming.StreamingContext._ > import scala.util.Random > import org.apache.spark._ > import org.apache.spark.storage._ > import org.apache.spark.streaming.receiver._ > import java.util.Date; > import java.text.SimpleDateFormat; > import java.util.Calendar; > import sys.process._ > import org.apache.spark.streaming.dstream.ConstantInputDStream > sc.setLogLevel("ERROR") > val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss") > var processed:Long = 0 > var pdate="" > case class t_row (card_number: String, event_date: Int, event_time: Int, > processed: Long, transport_type: String, card_credit: java.lang.Float, > transport_location: String, journey_type: Int, journey_value: > java.lang.Float) > var type2tot = 0 > var type5tot = 0 > var numb=0 > var total_secs:Double = 0 > val red= "\033[0;31m" > val green = "\033[0;32m" > val cyan = "\033[0;36m" > val yellow = "\033[0;33m" > val nocolour = "\033[0;0m" > var color = "" > val t_int = 5 > var init = 0 > var tot_cnt:Long = 0 > val ssc = new StreamingContext(sc, Seconds(t_int)) > val lines01 = ssc.socketTextStream(s_server01, s_port01) > val lines02 = ssc.socketTextStream(s_server01, s_port02) > // val lines = lines01.union(lines02) > val line01 = lines01.foreachRDD( rdd => { > println("\nline 01") > if (init == 0) {"clear".!;init = 1} > val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) > val processed = System.currentTimeMillis > val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, > System.currentTimeMillis, line(3), line(4).toFloat, line(5), > line(6).toInt, line(7).toFloat )) > val cnt:Long = bb.count > bb.saveToCassandra("transport", "card_data_input") > }) > //val line02 = lines02.foreachRDD( rdd => { > //println("line 02") > //if (init == 0) {"clear".!;init = 1} > //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) > //xx.collect.foreach(println) > //val processed = System.currentTimeMillis > //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, > System.currentTimeMillis, line(3), line(4).toFloat, line(5), > //line(6).toInt, line(7).toFloat )) > //val cnt:Long = bb.count > //bb.saveToCassandra("transport", "card_data_input") > //}) > ERROR: > software.kryo.KryoException: Encountered unregistered class ID: 13994 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org
[jira] [Resolved] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error
[ https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20695. --- Resolution: Invalid I don't believe that's anything to do with TCP; you are enabling Kryo registration but didn't register some class you are serializing. This is a question about debugging your app and shouldn't be a Spark JIRA. You need to read http://spark.apache.org/contributing.html too; you would never set Blocker for example. > Running multiple TCP socket streams in Spark Shell causes driver error > -- > > Key: SPARK-20695 > URL: https://issues.apache.org/jira/browse/SPARK-20695 > Project: Spark > Issue Type: Bug > Components: DStreams, Spark Core, Spark Shell, Structured Streaming >Affects Versions: 2.0.2 > Environment: DataStax DSE apache 3 node cassandra running with > analytics on RHEL 7.3 on Hyper-V windows 10 laptop. >Reporter: Peter Mead >Priority: Blocker > > Whenever I include a second socket stream (lines02) the script errors if I am > not trying to process data. If I remove the lines02 the script runs fine!! > script: > val s_server01="192.168.1.10" > val s_port01 = 8801 > val s_port02 = 8802 > import org.apache.spark.streaming._, > org.apache.spark.streaming.StreamingContext._ > import scala.util.Random > import org.apache.spark._ > import org.apache.spark.storage._ > import org.apache.spark.streaming.receiver._ > import java.util.Date; > import java.text.SimpleDateFormat; > import java.util.Calendar; > import sys.process._ > import org.apache.spark.streaming.dstream.ConstantInputDStream > sc.setLogLevel("ERROR") > val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss") > var processed:Long = 0 > var pdate="" > case class t_row (card_number: String, event_date: Int, event_time: Int, > processed: Long, transport_type: String, card_credit: java.lang.Float, > transport_location: String, journey_type: Int, journey_value: > java.lang.Float) > var type2tot = 0 > var type5tot = 0 > var numb=0 > var total_secs:Double = 0 > val red= "\033[0;31m" > val green = "\033[0;32m" > val cyan = "\033[0;36m" > val yellow = "\033[0;33m" > val nocolour = "\033[0;0m" > var color = "" > val t_int = 5 > var init = 0 > var tot_cnt:Long = 0 > val ssc = new StreamingContext(sc, Seconds(t_int)) > val lines01 = ssc.socketTextStream(s_server01, s_port01) > val lines02 = ssc.socketTextStream(s_server01, s_port02) > // val lines = lines01.union(lines02) > val line01 = lines01.foreachRDD( rdd => { > println("\nline 01") > if (init == 0) {"clear".!;init = 1} > val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) > val processed = System.currentTimeMillis > val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, > System.currentTimeMillis, line(3), line(4).toFloat, line(5), > line(6).toInt, line(7).toFloat )) > val cnt:Long = bb.count > bb.saveToCassandra("transport", "card_data_input") > }) > //val line02 = lines02.foreachRDD( rdd => { > //println("line 02") > //if (init == 0) {"clear".!;init = 1} > //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) > //xx.collect.foreach(println) > //val processed = System.currentTimeMillis > //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, > System.currentTimeMillis, line(3), line(4).toFloat, line(5), > //line(6).toInt, line(7).toFloat )) > //val cnt:Long = bb.count > //bb.saveToCassandra("transport", "card_data_input") > //}) > ERROR: > software.kryo.KryoException: Encountered unregistered class ID: 13994 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)...etc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error
[ https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-20695. - > Running multiple TCP socket streams in Spark Shell causes driver error > -- > > Key: SPARK-20695 > URL: https://issues.apache.org/jira/browse/SPARK-20695 > Project: Spark > Issue Type: Bug > Components: DStreams, Spark Core, Spark Shell, Structured Streaming >Affects Versions: 2.0.2 > Environment: DataStax DSE apache 3 node cassandra running with > analytics on RHEL 7.3 on Hyper-V windows 10 laptop. >Reporter: Peter Mead >Priority: Blocker > > Whenever I include a second socket stream (lines02) the script errors if I am > not trying to process data. If I remove the lines02 the script runs fine!! > script: > val s_server01="192.168.1.10" > val s_port01 = 8801 > val s_port02 = 8802 > import org.apache.spark.streaming._, > org.apache.spark.streaming.StreamingContext._ > import scala.util.Random > import org.apache.spark._ > import org.apache.spark.storage._ > import org.apache.spark.streaming.receiver._ > import java.util.Date; > import java.text.SimpleDateFormat; > import java.util.Calendar; > import sys.process._ > import org.apache.spark.streaming.dstream.ConstantInputDStream > sc.setLogLevel("ERROR") > val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss") > var processed:Long = 0 > var pdate="" > case class t_row (card_number: String, event_date: Int, event_time: Int, > processed: Long, transport_type: String, card_credit: java.lang.Float, > transport_location: String, journey_type: Int, journey_value: > java.lang.Float) > var type2tot = 0 > var type5tot = 0 > var numb=0 > var total_secs:Double = 0 > val red= "\033[0;31m" > val green = "\033[0;32m" > val cyan = "\033[0;36m" > val yellow = "\033[0;33m" > val nocolour = "\033[0;0m" > var color = "" > val t_int = 5 > var init = 0 > var tot_cnt:Long = 0 > val ssc = new StreamingContext(sc, Seconds(t_int)) > val lines01 = ssc.socketTextStream(s_server01, s_port01) > val lines02 = ssc.socketTextStream(s_server01, s_port02) > // val lines = lines01.union(lines02) > val line01 = lines01.foreachRDD( rdd => { > println("\nline 01") > if (init == 0) {"clear".!;init = 1} > val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) > val processed = System.currentTimeMillis > val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, > System.currentTimeMillis, line(3), line(4).toFloat, line(5), > line(6).toInt, line(7).toFloat )) > val cnt:Long = bb.count > bb.saveToCassandra("transport", "card_data_input") > }) > //val line02 = lines02.foreachRDD( rdd => { > //println("line 02") > //if (init == 0) {"clear".!;init = 1} > //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) > //xx.collect.foreach(println) > //val processed = System.currentTimeMillis > //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, > System.currentTimeMillis, line(3), line(4).toFloat, line(5), > //line(6).toInt, line(7).toFloat )) > //val cnt:Long = bb.count > //bb.saveToCassandra("transport", "card_data_input") > //}) > ERROR: > software.kryo.KryoException: Encountered unregistered class ID: 13994 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)...etc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004688#comment-16004688 ] Barry Becker commented on SPARK-20542: -- @viirya, your implementation of MultipleBucketizer relies on a withColumns method on dataframe. That method is not in 2.1.1 or 2.2. In which release will it be available? > Add an API into Bucketizer that can bin a lot of columns all at once > > > Key: SPARK-20542 > URL: https://issues.apache.org/jira/browse/SPARK-20542 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > Current ML's Bucketizer can only bin a column of continuous features. If a > dataset has thousands of of continuous columns needed to bin, we will result > in thousands of ML stages. It is very inefficient regarding query planning > and execution. > We should have a type of bucketizer that can bin a lot of columns all at > once. It would need to accept an list of arrays of split points to correspond > to the columns to bin, but it might make things more efficient by replacing > thousands of stages with just one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error
Peter Mead created SPARK-20695: -- Summary: Running multiple TCP socket streams in Spark Shell causes driver error Key: SPARK-20695 URL: https://issues.apache.org/jira/browse/SPARK-20695 Project: Spark Issue Type: Bug Components: DStreams, Spark Core, Spark Shell, Structured Streaming Affects Versions: 2.0.2 Environment: DataStax DSE apache 3 node cassandra running with analytics on RHEL 7.3 on Hyper-V windows 10 laptop. Reporter: Peter Mead Priority: Blocker Whenever I include a second socket stream (lines02) the script errors if I am not trying to process data. If I remove the lines02 the script runs fine!! script: val s_server01="192.168.1.10" val s_port01 = 8801 val s_port02 = 8802 import org.apache.spark.streaming._, org.apache.spark.streaming.StreamingContext._ import scala.util.Random import org.apache.spark._ import org.apache.spark.storage._ import org.apache.spark.streaming.receiver._ import java.util.Date; import java.text.SimpleDateFormat; import java.util.Calendar; import sys.process._ import org.apache.spark.streaming.dstream.ConstantInputDStream sc.setLogLevel("ERROR") val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss") var processed:Long = 0 var pdate="" case class t_row (card_number: String, event_date: Int, event_time: Int, processed: Long, transport_type: String, card_credit: java.lang.Float, transport_location: String, journey_type: Int, journey_value: java.lang.Float) var type2tot = 0 var type5tot = 0 var numb=0 var total_secs:Double = 0 val red= "\033[0;31m" val green = "\033[0;32m" val cyan = "\033[0;36m" val yellow = "\033[0;33m" val nocolour = "\033[0;0m" var color = "" val t_int = 5 var init = 0 var tot_cnt:Long = 0 val ssc = new StreamingContext(sc, Seconds(t_int)) val lines01 = ssc.socketTextStream(s_server01, s_port01) val lines02 = ssc.socketTextStream(s_server01, s_port02) // val lines = lines01.union(lines02) val line01 = lines01.foreachRDD( rdd => { println("\nline 01") if (init == 0) {"clear".!;init = 1} val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) val processed = System.currentTimeMillis val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, System.currentTimeMillis, line(3), line(4).toFloat, line(5), line(6).toInt, line(7).toFloat )) val cnt:Long = bb.count bb.saveToCassandra("transport", "card_data_input") }) //val line02 = lines02.foreachRDD( rdd => { //println("line 02") //if (init == 0) {"clear".!;init = 1} //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(","))) //xx.collect.foreach(println) //val processed = System.currentTimeMillis //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, System.currentTimeMillis, line(3), line(4).toFloat, line(5), //line(6).toInt, line(7).toFloat )) //val cnt:Long = bb.count //bb.saveToCassandra("transport", "card_data_input") //}) ERROR: software.kryo.KryoException: Encountered unregistered class ID: 13994 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)...etc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19447) Fix input metrics for range operator
[ https://issues.apache.org/jira/browse/SPARK-19447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004631#comment-16004631 ] Apache Spark commented on SPARK-19447: -- User 'ala' has created a pull request for this issue: https://github.com/apache/spark/pull/17939 > Fix input metrics for range operator > > > Key: SPARK-19447 > URL: https://issues.apache.org/jira/browse/SPARK-19447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Reynold Xin >Assignee: Ala Luszczak > Fix For: 2.2.0 > > > Range operator currently does not output any input metrics, and as a result > in the SQL UI the number of rows shown is always 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004607#comment-16004607 ] Zoltan Ivanfi commented on SPARK-12297: --- bq. It'd be great to consider this more holistically and think about alternatives in fixing them As Ryan mentioned, the Parquet community discussed this timestamp incompatibilty problem with the aim of avoiding similar problems in the future. It was decided that the specification needs to include two separate types with well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. (Otherwise implementors would be tempted to misuse the single existing type for storing timestamps of different semantics, as it already happened with the int96 timestamp type). While this is a nice and clean long-term solution, a short-term fix is also desired until the new types become widely supported and/or to allow dealing with existing data. The commit in question is a part of this short-term fix and it allows getting correct values when reading int96 timestamps, even for data written by other components. bq. it completely changes the behavior of one of the most important data types. A very important aspect of this fix is that it does not change SparkSQL's behavior unless the user sets a table property, so it's a completely safe and non-breaking change. bq. One of the fundamental problem is that Spark treats timestamp as timestamp with timezone, whereas impala treats timestamp as timestamp without timezone. The parquet storage is only a small piece here. The fix only addresses Parquet timestamps indeed. This, however, is intentional and is not a limitation, neither an inconsistency. The problem in fact is specific to Parquet. For other file formats (for example CSV or Avro), SparkSQL follows timezone-agnostic (TIMESTAMP WITHOUT TIMEZONE) semantics. So using UTC-normalized (TIMESTAMP WITH TIMEZONE) semantics in Parquet is not only incompatible with Impala but is also inconsistent within SparkSQL itself. bq. Also this is not just a Parquet issue. The same issue could happen to all data formats. It is going to be really confusing to have something that only works for Parquet In fact the current behavior of SparkSQL is different for Parquet than for other formats. The fix allows the user to choose a consistent and less confusing behaviour instead. It also makes Impala, Hive and SparkSQL compatible with each other regarding int96 timestamps. bq. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? Unfortunately that would not suffice. The problem has to addressed in all SQL engines. As of today, Hive and Impala already contains the changes that allow interoperability using the parquet.mr.int96.write.zone table property: * Hive: ** https://github.com/apache/hive/commit/84fdc1c7c8ff0922aa44f829dbfa9659935c503e ** https://github.com/apache/hive/commit/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5 ** https://github.com/apache/hive/commit/2dfcea5a95b7d623484b8be50755b817fbc91ce0 ** https://github.com/apache/hive/commit/78e29fc70dacec498c35dc556dd7403e4c9f48fe * Impala: ** https://github.com/apache/incubator-impala/commit/5803a0b0744ddaee6830d4a1bc8dba8d3f2caa26 > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { forma
[jira] [Assigned] (SPARK-20678) Ndv for columns not in filter condition should also be updated
[ https://issues.apache.org/jira/browse/SPARK-20678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20678: --- Assignee: Zhenhua Wang > Ndv for columns not in filter condition should also be updated > -- > > Key: SPARK-20678 > URL: https://issues.apache.org/jira/browse/SPARK-20678 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.1, 2.3.0 > > > In filter estimation, we update column stats for those columns in filter > condition. However, if the number of rows decreases after the filter (i.e. > the overall selectivity is less than 1), we need to update (scale down) the > number of distinct values (NDV) for all columns, no matter they are in filter > conditions or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
[ https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20694: Assignee: Apache Spark > Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide > -- > > Key: SPARK-20694 > URL: https://issues.apache.org/jira/browse/SPARK-20694 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark > > - Spark SQL, DataFrames and Datasets Guide should contain a section about > partitioned, sorted and bucketed writes. > - Bucketing should be removed from Unsupported Hive Functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20678) Ndv for columns not in filter condition should also be updated
[ https://issues.apache.org/jira/browse/SPARK-20678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20678. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 17918 [https://github.com/apache/spark/pull/17918] > Ndv for columns not in filter condition should also be updated > -- > > Key: SPARK-20678 > URL: https://issues.apache.org/jira/browse/SPARK-20678 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > Fix For: 2.2.1, 2.3.0 > > > In filter estimation, we update column stats for those columns in filter > condition. However, if the number of rows decreases after the filter (i.e. > the overall selectivity is less than 1), we need to update (scale down) the > number of distinct values (NDV) for all columns, no matter they are in filter > conditions or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
[ https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20694: Assignee: (was: Apache Spark) > Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide > -- > > Key: SPARK-20694 > URL: https://issues.apache.org/jira/browse/SPARK-20694 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > - Spark SQL, DataFrames and Datasets Guide should contain a section about > partitioned, sorted and bucketed writes. > - Bucketing should be removed from Unsupported Hive Functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
[ https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004523#comment-16004523 ] Apache Spark commented on SPARK-20694: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17938 > Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide > -- > > Key: SPARK-20694 > URL: https://issues.apache.org/jira/browse/SPARK-20694 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > - Spark SQL, DataFrames and Datasets Guide should contain a section about > partitioned, sorted and bucketed writes. > - Bucketing should be removed from Unsupported Hive Functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
Maciej Szymkiewicz created SPARK-20694: -- Summary: Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide Key: SPARK-20694 URL: https://issues.apache.org/jira/browse/SPARK-20694 Project: Spark Issue Type: Documentation Components: Documentation, Examples, SQL Affects Versions: 2.2.0 Reporter: Maciej Szymkiewicz - Spark SQL, DataFrames and Datasets Guide should contain a section about partitioned, sorted and bucketed writes. - Bucketing should be removed from Unsupported Hive Functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20688) correctly check analysis for scalar sub-queries
[ https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20688. - Resolution: Fixed Fix Version/s: 2.1.2 2.3.0 2.2.1 Issue resolved by pull request 17930 [https://github.com/apache/spark/pull/17930] > correctly check analysis for scalar sub-queries > --- > > Key: SPARK-20688 > URL: https://issues.apache.org/jira/browse/SPARK-20688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.1, 2.3.0, 2.1.2 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20693) Kafka+SSL: path for security related files needs to be different for driver and executors
Daniel Lanza García created SPARK-20693: --- Summary: Kafka+SSL: path for security related files needs to be different for driver and executors Key: SPARK-20693 URL: https://issues.apache.org/jira/browse/SPARK-20693 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.1.1 Reporter: Daniel Lanza García Priority: Critical When consuming/producing from Kafka with security enable (SSL), you need to refer to security related files (keystore and truststore) in the configuration of the KafkaDirectStream. If the scenario is YARN-client mode, you would need to distribute these files, it can be achieved with --files argument. Now, what is the path to these files? taking into account that driver and executors interact with Kafka. When these files are accessed from the driver, you need to provide the local path to them. When they are accessed from the executors, you need to provide the name of the file that has been distributed with --files. The problem is that you can only configure one value for the path to these files. Proposed configurations here: http://www.opencore.com/blog/2017/1/spark-2-0-streaming-from-ssl-kafka-with-hdp-2-4/ works because both paths are the same (./truststore.jks). But if different, I do not think there is a way to configure Kafka+SSL -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20692) unknowing delay in event timeline
[ https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20692. --- Resolution: Invalid This isn't appropriate for a JIRA. Questions should go to u...@spark.apache.org. There are lots of reasons for delays between jobs; many things could be busy on the driver. It's not actionable. > unknowing delay in event timeline > - > > Key: SPARK-20692 > URL: https://issues.apache.org/jira/browse/SPARK-20692 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.2 > Environment: Spark 1.6.1 + kafka 0.8.2 >Reporter: Zhiwen Sun > Attachments: screenshot-1.png > > > Spark streaming job with 1s interval. > Process time of micro batch suddenly became to 4s while is is usually 0.4s . > When we check where the time spent, we find a unknown delay in job. > There is no executor computing or shuffle reading. It is about 4s blank in > event timeline, > event timeline snapshot is in attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20393. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17686 [https://github.com/apache/spark/pull/17686] > Strengthen Spark to prevent XSS vulnerabilities > --- > > Key: SPARK-20393 > URL: https://issues.apache.org/jira/browse/SPARK-20393 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.2, 2.0.2, 2.1.0 >Reporter: Nicholas Marion >Assignee: Nicholas Marion >Priority: Minor > Labels: security > Fix For: 2.3.0 > > > Using IBM Security AppScan Standard, we discovered several easy to recreate > MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI > application and these vulnerabilities were found to exist in Spark version > 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting > attack is not really an attack on the Spark server as much as an attack on > the end user, taking advantage of their trust in the Spark server to get them > to click on a URL like the ones in the examples below. So whether the user > could or could not change lots of stuff on the Spark server is not the key > point. It is an attack on the user themselves. If they click the link the > script could run in their browser and comprise their device. Once the > browser is compromised it could submit Spark requests but it also might not. > https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/ > {quote} > Request: GET > /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a-- > _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > HTTP/1.1 > Excerpt from response: No running application with ID > Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET > /history/app-20161012202114-0038/stages/stage?id=1&attempt=0&task.sort=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a&tas > k.pageSize=100 HTTP/1.1 > Excerpt from response: Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET /log?appId=app-20170113131903-&executorId=0&logType=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a&byt > eLength=0 HTTP/1.1 > Excerpt from response: Bytes 0-0 of 0 of > /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content- > Type: multipart/related; boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > {quote} > security@apache was notified and recommended a PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20393: - Assignee: Nicholas Marion Labels: security (was: newbie security) Priority: Minor (was: Major) > Strengthen Spark to prevent XSS vulnerabilities > --- > > Key: SPARK-20393 > URL: https://issues.apache.org/jira/browse/SPARK-20393 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.2, 2.0.2, 2.1.0 >Reporter: Nicholas Marion >Assignee: Nicholas Marion >Priority: Minor > Labels: security > Fix For: 2.3.0 > > > Using IBM Security AppScan Standard, we discovered several easy to recreate > MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI > application and these vulnerabilities were found to exist in Spark version > 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting > attack is not really an attack on the Spark server as much as an attack on > the end user, taking advantage of their trust in the Spark server to get them > to click on a URL like the ones in the examples below. So whether the user > could or could not change lots of stuff on the Spark server is not the key > point. It is an attack on the user themselves. If they click the link the > script could run in their browser and comprise their device. Once the > browser is compromised it could submit Spark requests but it also might not. > https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/ > {quote} > Request: GET > /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a-- > _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > HTTP/1.1 > Excerpt from response: No running application with ID > Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET > /history/app-20161012202114-0038/stages/stage?id=1&attempt=0&task.sort=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a&tas > k.pageSize=100 HTTP/1.1 > Excerpt from response: Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET /log?appId=app-20170113131903-&executorId=0&logType=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a&byt > eLength=0 HTTP/1.1 > Excerpt from response: Bytes 0-0 of 0 of > /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content- > Type: multipart/related; boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > {quote} > security@apache was notified and recommended a PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20692) unknowing delay in event timeline
[ https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiwen Sun updated SPARK-20692: --- Description: Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . When we check where the time spent, we find a unknown delay in job. There is no executor computing or shuffle reading. It is about 4s blank in event timeline, event timeline snapshot is in attachment. was: Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . when we check where time spent, we find a unknown delay in job. there is no executor computing or shuffle reading. About 4s blank in event timeline, event timeline snapshot is in attachment. > unknowing delay in event timeline > - > > Key: SPARK-20692 > URL: https://issues.apache.org/jira/browse/SPARK-20692 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.2 > Environment: Spark 1.6.1 + kafka 0.8.2 >Reporter: Zhiwen Sun > Attachments: screenshot-1.png > > > Spark streaming job with 1s interval. > Process time of micro batch suddenly became to 4s while is is usually 0.4s . > When we check where the time spent, we find a unknown delay in job. > There is no executor computing or shuffle reading. It is about 4s blank in > event timeline, > event timeline snapshot is in attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20692) unknowing delay in event timeline
[ https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiwen Sun updated SPARK-20692: --- Description: Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . when we check where time spent, we find a unknown delay in job. there is no executor computing or shuffle reading. About 4s blank in event timeline, event timeline snapshot is in attachment. was: Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . when we check where time spent, we find a unknown delay in job. there is no executor computing or shuffle reading. About 4s blank in event timeline, > unknowing delay in event timeline > - > > Key: SPARK-20692 > URL: https://issues.apache.org/jira/browse/SPARK-20692 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.2 > Environment: Spark 1.6.1 + kafka 0.8.2 >Reporter: Zhiwen Sun > Attachments: screenshot-1.png > > > Spark streaming job with 1s interval. > Process time of micro batch suddenly became to 4s while is is usually 0.4s . > when we check where time spent, we find a unknown delay in job. there is no > executor computing or shuffle reading. About 4s blank in event timeline, > event timeline snapshot is in attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20692) unknowing delay in event timeline
[ https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiwen Sun updated SPARK-20692: --- Attachment: screenshot-1.png > unknowing delay in event timeline > - > > Key: SPARK-20692 > URL: https://issues.apache.org/jira/browse/SPARK-20692 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.2 > Environment: Spark 1.6.1 + kafka 0.8.2 >Reporter: Zhiwen Sun > Attachments: screenshot-1.png > > > Spark streaming job with 1s interval. > Process time of micro batch suddenly became to 4s while is is usually 0.4s . > when we check where time spent, we find a unknown delay in job. there is no > executor computing or shuffle reading. About 4s blank in event timeline, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20692) unknowing delay in event timeline
Zhiwen Sun created SPARK-20692: -- Summary: unknowing delay in event timeline Key: SPARK-20692 URL: https://issues.apache.org/jira/browse/SPARK-20692 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.6.2 Environment: Spark 1.6.1 + kafka 0.8.2 Reporter: Zhiwen Sun Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . when we check where time spent, we find a unknown delay in job. there is no executor computing or shuffle reading. About 4s blank in event timeline, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004417#comment-16004417 ] David Hodeffi commented on SPARK-20433: --- Did you upgrade json4s? since 3.2.1 is not compatible with 3.5 > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI
[ https://issues.apache.org/jira/browse/SPARK-20691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004413#comment-16004413 ] Sean Owen commented on SPARK-20691: --- I think that we have, unfortunately, not consistently differentiated between megabytes (MB, 10^6 = 100 bytes) and mebibytes (MiB, 2^20 = 1048576 bytes). The javascript function is actually correctly computing MB, and the rest of Spark is not in {{Utils.bytesToString}}. Where the user supplies a value like "700m" it's interpreted as mebibytes. That's fine and at least unambiguous and not-wrong. I don't think we want to change behavior, but we can make display and log output more consistent and correct. The inconsistency is bad of course. The simple change is to change the javascript, and at least update its strings to say "MiB" etc correctly. I think {{Utils.bytesToString}} should be changed too. If you want to be a hero, you might look for anywhere "MB" or "KB" occurs in the code and see if the value is being computed correctly. > Difference between Storage Memory as seen internally and in web UI > -- > > Key: SPARK-20691 > URL: https://issues.apache.org/jira/browse/SPARK-20691 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski > > I set Major priority as it's visible to a user. > There's a difference in what the size of Storage Memory is managed internally > and displayed to a user in web UI. > I found it while answering [How does web UI calculate Storage Memory (in > Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow. > In short (quoting the main parts), when you start a Spark app (say > spark-shell) you see 912.3 MB RAM for Storage Memory: > {code} > $ ./bin/spark-shell --conf spark.driver.memory=2g > ... > 17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, > 57177, None) > {code} > but in the web UI you'll see 956.6 MB due to the way the custom JavaScript > function {{formatBytes}} in > [utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48] > calculates the value. That translates to the following Scala code: > {code} > def formatBytes(bytes: Double) = { > val k = 1000 > val i = math.floor(math.log(bytes) / math.log(k)) > val maxMemoryWebUI = bytes / math.pow(k, i) > f"$maxMemoryWebUI%1.1f" > } > scala> println(formatBytes(maxMemory)) > 956.6 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20433. --- Resolution: Not A Problem OK, I don't see evidence that this isn't just an instance of the problem that was already patched. Updating past Jackson 2.6 is a separate and more complex issue, otherwise we'd just update to be safe. > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
[ https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20687: -- Priority: Minor (was: Critical) > mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix > > > Key: SPARK-20687 > URL: https://issues.apache.org/jira/browse/SPARK-20687 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Ignacio Bermudez Corrales >Priority: Minor > > Conversion of Breeze sparse matrices to Matrix is broken when matrices are > product of certain operations. This problem I think is caused by the update > method in Breeze CSCMatrix when they add provisional zeros to the data for > efficiency. > This bug is serious and may affect at least BlockMatrix addition and > substraction > http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458 > The following code, reproduces the bug (Check test("breeze conversion bug")) > https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala > {code:title=MatricesSuite.scala|borderStyle=solid} > test("breeze conversion bug") { > // (2, 0, 0) > // (2, 0, 0) > val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), > Array(2, 2)).asBreeze > // (2, 1E-15, 1E-15) > // (2, 1E-15, 1E-15 > val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, > 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze > // The following shouldn't break > val t01 = mat1Brz - mat1Brz > val t02 = mat2Brz - mat2Brz > val t02Brz = Matrices.fromBreeze(t02) > val t01Brz = Matrices.fromBreeze(t01) > val t1Brz = mat1Brz - mat2Brz > val t2Brz = mat2Brz - mat1Brz > // The following ones should break > val t1 = Matrices.fromBreeze(t1Brz) > val t2 = Matrices.fromBreeze(t2Brz) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
[ https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004372#comment-16004372 ] Sean Owen commented on SPARK-20687: --- This doesn't say what the problem is. What goes wrong? > mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix > > > Key: SPARK-20687 > URL: https://issues.apache.org/jira/browse/SPARK-20687 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Ignacio Bermudez Corrales >Priority: Critical > > Conversion of Breeze sparse matrices to Matrix is broken when matrices are > product of certain operations. This problem I think is caused by the update > method in Breeze CSCMatrix when they add provisional zeros to the data for > efficiency. > This bug is serious and may affect at least BlockMatrix addition and > substraction > http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458 > The following code, reproduces the bug (Check test("breeze conversion bug")) > https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala > {code:title=MatricesSuite.scala|borderStyle=solid} > test("breeze conversion bug") { > // (2, 0, 0) > // (2, 0, 0) > val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), > Array(2, 2)).asBreeze > // (2, 1E-15, 1E-15) > // (2, 1E-15, 1E-15 > val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, > 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze > // The following shouldn't break > val t01 = mat1Brz - mat1Brz > val t02 = mat2Brz - mat2Brz > val t02Brz = Matrices.fromBreeze(t02) > val t01Brz = Matrices.fromBreeze(t01) > val t1Brz = mat1Brz - mat2Brz > val t2Brz = mat2Brz - mat1Brz > // The following ones should break > val t1 = Matrices.fromBreeze(t1Brz) > val t2 = Matrices.fromBreeze(t2Brz) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20637) MappedRDD, FilteredRDD, etc. are still referenced in code comments
[ https://issues.apache.org/jira/browse/SPARK-20637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20637: - Assignee: Michael Mior > MappedRDD, FilteredRDD, etc. are still referenced in code comments > -- > > Key: SPARK-20637 > URL: https://issues.apache.org/jira/browse/SPARK-20637 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Mior >Assignee: Michael Mior >Priority: Trivial > Fix For: 2.2.1 > > > There are only a couple instances of this, but it would be helpful to have > things updated to current references. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org