[jira] [Assigned] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information
[ https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21880: Assignee: Apache Spark > [spark UI]In the SQL table page, modify jobs trace information > -- > > Key: SPARK-21880 > URL: https://issues.apache.org/jira/browse/SPARK-21880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: he.qiao >Assignee: Apache Spark >Priority: Minor > > I think it makes sense for "jobs" to change to "job id" in the SQL table > page. Because when job 5 fails, it's easy to misunderstand that five jobs > have failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information
[ https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21880: Assignee: (was: Apache Spark) > [spark UI]In the SQL table page, modify jobs trace information > -- > > Key: SPARK-21880 > URL: https://issues.apache.org/jira/browse/SPARK-21880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: he.qiao >Priority: Minor > > I think it makes sense for "jobs" to change to "job id" in the SQL table > page. Because when job 5 fails, it's easy to misunderstand that five jobs > have failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information
[ https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148521#comment-16148521 ] Apache Spark commented on SPARK-21880: -- User 'Geek-He' has created a pull request for this issue: https://github.com/apache/spark/pull/19093 > [spark UI]In the SQL table page, modify jobs trace information > -- > > Key: SPARK-21880 > URL: https://issues.apache.org/jira/browse/SPARK-21880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: he.qiao >Priority: Minor > > I think it makes sense for "jobs" to change to "job id" in the SQL table > page. Because when job 5 fails, it's easy to misunderstand that five jobs > have failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information
[ https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] he.qiao updated SPARK-21880: Summary: [spark UI]In the SQL table page, modify jobs trace information (was: [spark UI]In the SQL table page, ) > [spark UI]In the SQL table page, modify jobs trace information > -- > > Key: SPARK-21880 > URL: https://issues.apache.org/jira/browse/SPARK-21880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: he.qiao >Priority: Minor > > I think it makes sense for "jobs" to change to "job id" in the SQL table > page. Because when job 5 fails, it's easy to misunderstand that five jobs > have failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21880) [spark UI]In the SQL table page,
he.qiao created SPARK-21880: --- Summary: [spark UI]In the SQL table page, Key: SPARK-21880 URL: https://issues.apache.org/jira/browse/SPARK-21880 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.2.0 Reporter: he.qiao Priority: Minor I think it makes sense for "jobs" to change to "job id" in the SQL table page. Because when job 5 fails, it's easy to misunderstand that five jobs have failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21879) Should Scalers handel NaN values?
zhengruifeng created SPARK-21879: Summary: Should Scalers handel NaN values? Key: SPARK-21879 URL: https://issues.apache.org/jira/browse/SPARK-21879 Project: Spark Issue Type: Question Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng The way {{ML.Scalers}} handling {{NaN}} is somewhat unexpected. Current impl of {{MinMaxScaler}}/{{MaxAbsScaler}}/{{StandardScaler}} all support {{fit}} and {{transform}} on a dataset containing {{NaN}}. Note that values in the second column in the following dataframe are all {{NaN}}, and the coefficients of {{min/max}} in {{MinMaxScalerModel}} and {{maxAbs}} in {{MaxAbsScaler}} are wrong. {code} import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg.{Vector, Vectors} scala> val data = Array( | Vectors.dense(1, Double.NaN, Double.NaN, 2.0), | Vectors.dense(2, Double.NaN, 0.0, 3.0), | Vectors.dense(3, Double.NaN, 0.0, 1.0), | Vectors.dense(6, Double.NaN, 2.0, Double.NaN)).zipWithIndex data: Array[(org.apache.spark.ml.linalg.Vector, Int)] = Array(([1.0,NaN,NaN,2.0],0), ([2.0,NaN,0.0,3.0],1), ([3.0,NaN,0.0,1.0],2), ([6.0,NaN,2.0,NaN],3)) scala> val df = data.toSeq.toDF("features", "id") df: org.apache.spark.sql.DataFrame = [features: vector, id: int] scala> val scaler = new MinMaxScaler().setInputCol("features").setOutputCol("scaled") scaler: org.apache.spark.ml.feature.MinMaxScaler = minMaxScal_7634802f5c81 scala> val model = scaler.fit(df) model: org.apache.spark.ml.feature.MinMaxScalerModel = minMaxScal_7634802f5c81 scala> model.originalMax res1: org.apache.spark.ml.linalg.Vector = [6.0,-1.7976931348623157E308,2.0,3.0] scala> model.originalMin res2: org.apache.spark.ml.linalg.Vector = [1.0,1.7976931348623157E308,0.0,1.0] scala> model.transform(df).select("scaled").collect res3: Array[org.apache.spark.sql.Row] = Array([[0.0,NaN,NaN,0.5]], [[0.2,NaN,0.0,1.0]], [[0.4,NaN,0.0,0.0]], [[1.0,NaN,1.0,NaN]]) scala> val scaler2 = new MaxAbsScaler().setInputCol("features").setOutputCol("scaled") scaler2: org.apache.spark.ml.feature.MaxAbsScaler = maxAbsScal_5d34fa818229 scala> val model2 = scaler2.fit(df) model2: org.apache.spark.ml.feature.MaxAbsScalerModel = maxAbsScal_5d34fa818229 scala> model2.maxAbs res4: org.apache.spark.ml.linalg.Vector = [6.0,1.7976931348623157E308,2.0,3.0] scala> model2.transform(df).select("scaled").collect res5: Array[org.apache.spark.sql.Row] = Array([[0.1,NaN,NaN,0.]], [[0.,NaN,0.0,1.0]], [[0.5,NaN,0.0,0.]], [[1.0,NaN,1.0,NaN]]) scala> val scaler3 = new StandardScaler().setInputCol("features").setOutputCol("scaled") scaler3: org.apache.spark.ml.feature.StandardScaler = stdScal_d8509095e860 scala> val model3 = scaler3.fit(df) model3: org.apache.spark.ml.feature.StandardScalerModel = stdScal_d8509095e860 scala> model3.std res11: org.apache.spark.ml.linalg.Vector = [2.160246899469287,NaN,NaN,NaN] scala> model3.mean res12: org.apache.spark.ml.linalg.Vector = [3.0,NaN,NaN,NaN] scala> model3.transform(df).select("scaled").collect res14: Array[org.apache.spark.sql.Row] = Array([[0.4629100498862757,NaN,NaN,NaN]], [[0.9258200997725514,NaN,NaN,NaN]], [[1.3887301496588271,NaN,NaN,NaN]], [[2.7774602993176543,NaN,NaN,NaN]]) {code} I then test the scalers in scikit-learn, and they all throw exceptions in both {{fit}} and {{transform}}. {code} import numpy as np from sklearn.preprocessing import * data = np.array([[-1, 2], [-0.5, 6], [0, np.nan], [1, 1.8]]) data2 = np.array([[-1, 2], [-0.5, 6], [0, 2.0], [1, 1.8]]) for scaler in [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), RobustScaler()]: try: scaler.fit(data) except: print('{0}.fit fails'.format(scaler)) model = scaler.fit(data2) try: model.transform(data) except: print('{0}.transform fails'.format(scaler)) StandardScaler(copy=True, with_mean=True, with_std=True).fit fails StandardScaler(copy=True, with_mean=True, with_std=True).transform fails MinMaxScaler(copy=True, feature_range=(0, 1)).fit fails MinMaxScaler(copy=True, feature_range=(0, 1)).transform fails MaxAbsScaler(copy=True).fit fails MaxAbsScaler(copy=True).transform fails RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True, with_scaling=True).fit fails RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True, with_scaling=True).transform fails {code} I think the behavior of handling {{NaN}} should keep in line with the impl in scikit-learn: the scaled data are likely to be fed into classification/regression/clustering or other algs, and it will be dangerous if the users are unaware of the `NaN` in scaled data. There maybe two choices if we decide to change the behavior: 1, add validation for input data and throw exception
[jira] [Assigned] (SPARK-21878) Create SQLMetricsTestUtils
[ https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21878: Assignee: Xiao Li (was: Apache Spark) > Create SQLMetricsTestUtils > -- > > Key: SPARK-21878 > URL: https://issues.apache.org/jira/browse/SPARK-21878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific > and the other SQLMetrics test cases. > Also, move two SQLMetrics test cases from sql/hive to sql/core. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21878) Create SQLMetricsTestUtils
[ https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21878: Assignee: Apache Spark (was: Xiao Li) > Create SQLMetricsTestUtils > -- > > Key: SPARK-21878 > URL: https://issues.apache.org/jira/browse/SPARK-21878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific > and the other SQLMetrics test cases. > Also, move two SQLMetrics test cases from sql/hive to sql/core. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21878) Create SQLMetricsTestUtils
[ https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148494#comment-16148494 ] Apache Spark commented on SPARK-21878: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19092 > Create SQLMetricsTestUtils > -- > > Key: SPARK-21878 > URL: https://issues.apache.org/jira/browse/SPARK-21878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific > and the other SQLMetrics test cases. > Also, move two SQLMetrics test cases from sql/hive to sql/core. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21878) Create SQLMetricsTestUtils
Xiao Li created SPARK-21878: --- Summary: Create SQLMetricsTestUtils Key: SPARK-21878 URL: https://issues.apache.org/jira/browse/SPARK-21878 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases. Also, move two SQLMetrics test cases from sql/hive to sql/core. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition
[ https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148463#comment-16148463 ] Tejas Patil commented on SPARK-20313: - I tried to replicate what you shared in the jira but dont see anything wrong with what planner is doing. Comparing both the approaches, `SortMergeJoin` is always being picked. The second approach does joins over individual partitions one by one and then unions the results. Depending on your data size + configs, it might be possible that for your case a hash based join was used which would explain why the later approach is faster. Approach #1 {noformat} val df1 = hc.sql("SELECT * FROM bucketed_partitioned_1").filter(functions.col("ds").between("1", "5")) val df2 = hc.sql("SELECT * FROM bucketed_partitioned_2").filter(functions.col("ds").between("1", "5")) val df3 = df1.join(df2, Seq("ds", "user_id")).explain(true) == Physical Plan == *Project [ds#38, user_id#36, name#37, name#45] +- *SortMergeJoin [ds#38, user_id#36], [ds#46, user_id#44], [ds#38, user_id#36], [ds#46, user_id#44], Inner :- *Sort [ds#38 ASC NULLS FIRST, user_id#36 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ds#38, user_id#36, 200) : +- *Filter isnotnull(user_id#36) :+- HiveTableScan [user_id#36, name#37, ds#38], HiveTableRelation `default`.`bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#36, name#37], [ds#38], [isnotnull(ds#38), (ds#38 >= 1), (ds#38 <= 5)] +- *Sort [ds#46 ASC NULLS FIRST, user_id#44 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(ds#46, user_id#44, 200) +- *Filter isnotnull(user_id#44) +- HiveTableScan [user_id#44, name#45, ds#46], HiveTableRelation `default`.`bucketed_partitioned_2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#44, name#45], [ds#46], [isnotnull(ds#46), (ds#46 >= 1), (ds#46 <= 5)] {noformat} Approach #2 {noformat} val df1 = hc.sql("SELECT * FROM bucketed_partitioned_1") val df2 = hc.sql("SELECT * FROM bucketed_partitioned_2") val dsValues = Seq("-11-11", "-44-44") val df3 = dsValues.map(dsValue => { val df1filtered = df1.filter(functions.col("ds") === dsValue) val df2filtered = df2.filter(functions.col("ds") === dsValue) df1filtered.join(df2filtered, Seq("user_id")) // part1 removed from join }).reduce(_ union _) == Physical Plan == Union :- *Project [user_id#63, name#64, ds#65, name#71, ds#72] : +- *SortMergeJoin [user_id#63], [user_id#70], [user_id#63], [user_id#70], Inner : :- *Sort [user_id#63 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(user_id#63, 200) : : +- *Filter isnotnull(user_id#63) : :+- HiveTableScan [user_id#63, name#64, ds#65], HiveTableRelation `default`.`bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#63, name#64], [ds#65], [isnotnull(ds#65), (ds#65 = -11-11)] : +- *Sort [user_id#70 ASC NULLS FIRST], false, 0 :+- Exchange hashpartitioning(user_id#70, 200) : +- *Filter isnotnull(user_id#70) : +- HiveTableScan [user_id#70, name#71, ds#72], HiveTableRelation `default`.`bucketed_partitioned_2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#70, name#71], [ds#72], [isnotnull(ds#72), (ds#72 = -11-11)] +- *Project [user_id#63, name#64, ds#65, name#71, ds#72] +- *SortMergeJoin [user_id#63], [user_id#70], [user_id#63], [user_id#70], Inner :- *Sort [user_id#63 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(user_id#63, 200) : +- *Filter isnotnull(user_id#63) :+- HiveTableScan [user_id#63, name#64, ds#65], HiveTableRelation `default`.`bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#63, name#64], [ds#65], [isnotnull(ds#65), (ds#65 = -44-44)] +- *Sort [user_id#70 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(user_id#70, 200) +- *Filter isnotnull(user_id#70) +- HiveTableScan [user_id#70, name#71, ds#72], HiveTableRelation `default`.`bucketed_partitioned_2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#70, name#71], [ds#72], [isnotnull(ds#72), (ds#72 = -44-44)] {noformat} > Possible lack of join optimization when partitions are in the join condition > > > Key: SPARK-20313 > URL: https://issues.apache.org/jira/browse/SPARK-20313 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Albert Meltzer > > Given two tables T1 and T2, partitioned on column part1, the following have > vastly different execution performance: > // initial, slow > {noformat} > val df1 = // load data from
[jira] [Resolved] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
[ https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-21583. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18787 [https://github.com/apache/spark/pull/18787] > Create a ColumnarBatch with ArrowColumnVectors for row based iteration > -- > > Key: SPARK-21583 > URL: https://issues.apache.org/jira/browse/SPARK-21583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.3.0 > > > The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. > It would be useful to be able to create a {{ColumnarBatch}} to allow row > based iteration over multiple {{ArrowColumnVectors}}. This would avoid extra > copying to translate column elements into rows and be more efficient memory > usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
[ https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-21583: - Assignee: Bryan Cutler > Create a ColumnarBatch with ArrowColumnVectors for row based iteration > -- > > Key: SPARK-21583 > URL: https://issues.apache.org/jira/browse/SPARK-21583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler > > The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. > It would be useful to be able to create a {{ColumnarBatch}} to allow row > based iteration over multiple {{ArrowColumnVectors}}. This would avoid extra > copying to translate column elements into rows and be more efficient memory > usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray
[ https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21534: Assignee: Liang-Chi Hsieh > PickleException when creating dataframe from python row with empty bytearray > > > Key: SPARK-21534 > URL: https://issues.apache.org/jira/browse/SPARK-21534 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > {code} > spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: > {"abc": x.xx})).show() > {code} > This code creates exception. It looks like corner-case. > {code} > net.razorvine.pickle.PickleException: invalid pickle data for bytearray; > expected 1 or 2 args, got 0 > at > net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit
[jira] [Resolved] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray
[ https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21534. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19085 [https://github.com/apache/spark/pull/19085] > PickleException when creating dataframe from python row with empty bytearray > > > Key: SPARK-21534 > URL: https://issues.apache.org/jira/browse/SPARK-21534 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Fix For: 2.3.0 > > > {code} > spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: > {"abc": x.xx})).show() > {code} > This code creates exception. It looks like corner-case. > {code} > net.razorvine.pickle.PickleException: invalid pickle data for bytearray; > expected 1 or 2 args, got 0 > at > net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) > at > org.apach
[jira] [Commented] (SPARK-16854) mapWithState Support for Python
[ https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148343#comment-16148343 ] Radek Ostrowski commented on SPARK-16854: - +1 > mapWithState Support for Python > --- > > Key: SPARK-16854 > URL: https://issues.apache.org/jira/browse/SPARK-16854 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 >Reporter: Boaz > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148333#comment-16148333 ] Saisai Shao commented on SPARK-11574: - Maybe it is my permission issue, will ask PMC to handle it. > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148328#comment-16148328 ] Xiaofeng Lin commented on SPARK-11574: -- [~jerryshao], yes my JIRA username is still active. It's "xflin". > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-17321: --- Assignee: Saisai Shao > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0, 2.1.1 >Reporter: yunjiong zhao >Assignee: Saisai Shao > Fix For: 2.3.0 > > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148319#comment-16148319 ] Hyukjin Kwon commented on SPARK-11574: -- I can't assign the ID too and I met this issue too - https://issues.apache.org/jira/browse/SPARK-21658?focusedCommentId=16125906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16125906 but I just ended up with leaving it .. > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-17321. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19032 [https://github.com/apache/spark/pull/19032] > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0, 2.1.1 >Reporter: yunjiong zhao > Fix For: 2.3.0 > > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148292#comment-16148292 ] Saisai Shao commented on SPARK-11574: - Hi Xiaofeng, is your JIRA username still available, I cannot assign the JIRA to you, since I cannot find your name. [~srowen] [~hyukjin.kwon] do you know how to handle this situation? > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-11574. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 9518 [https://github.com/apache/spark/pull/9518] > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21877) Windows command script can not handle quotes in parameter
[ https://issues.apache.org/jira/browse/SPARK-21877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21877: Assignee: Apache Spark > Windows command script can not handle quotes in parameter > - > > Key: SPARK-21877 > URL: https://issues.apache.org/jira/browse/SPARK-21877 > Project: Spark > Issue Type: Bug > Components: Deploy, Windows >Affects Versions: 2.2.0 > Environment: Spark version: spark-2.2.0-bin-hadoop2.7 > Windows version: Windows 10 >Reporter: Xiaokai Zhao >Assignee: Apache Spark > > All the windows command scripts can not handle quotes in parameter. > Run a windows command shell with parameter which has quotes can reproduce the > bug: > {quote} > C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell > --driver-java-options " -Dfile.encoding=utf-8 " > 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" > --driver-java-options "' is not recognized as an internal or external command, > operable program or batch file. > {quote} > Windows recognize "--driver-java-options" as part of the command. > All the Windows command script has the following code have the bug. > {quote} > cmd /V /E /C "" %* > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21877) Windows command script can not handle quotes in parameter
[ https://issues.apache.org/jira/browse/SPARK-21877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21877: Assignee: (was: Apache Spark) > Windows command script can not handle quotes in parameter > - > > Key: SPARK-21877 > URL: https://issues.apache.org/jira/browse/SPARK-21877 > Project: Spark > Issue Type: Bug > Components: Deploy, Windows >Affects Versions: 2.2.0 > Environment: Spark version: spark-2.2.0-bin-hadoop2.7 > Windows version: Windows 10 >Reporter: Xiaokai Zhao > > All the windows command scripts can not handle quotes in parameter. > Run a windows command shell with parameter which has quotes can reproduce the > bug: > {quote} > C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell > --driver-java-options " -Dfile.encoding=utf-8 " > 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" > --driver-java-options "' is not recognized as an internal or external command, > operable program or batch file. > {quote} > Windows recognize "--driver-java-options" as part of the command. > All the Windows command script has the following code have the bug. > {quote} > cmd /V /E /C "" %* > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21877) Windows command script can not handle quotes in parameter
[ https://issues.apache.org/jira/browse/SPARK-21877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148278#comment-16148278 ] Apache Spark commented on SPARK-21877: -- User 'minixalpha' has created a pull request for this issue: https://github.com/apache/spark/pull/19090 > Windows command script can not handle quotes in parameter > - > > Key: SPARK-21877 > URL: https://issues.apache.org/jira/browse/SPARK-21877 > Project: Spark > Issue Type: Bug > Components: Deploy, Windows >Affects Versions: 2.2.0 > Environment: Spark version: spark-2.2.0-bin-hadoop2.7 > Windows version: Windows 10 >Reporter: Xiaokai Zhao > > All the windows command scripts can not handle quotes in parameter. > Run a windows command shell with parameter which has quotes can reproduce the > bug: > {quote} > C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell > --driver-java-options " -Dfile.encoding=utf-8 " > 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" > --driver-java-options "' is not recognized as an internal or external command, > operable program or batch file. > {quote} > Windows recognize "--driver-java-options" as part of the command. > All the Windows command script has the following code have the bug. > {quote} > cmd /V /E /C "" %* > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21877) Windows command script can not handle quotes in parameter
Xiaokai Zhao created SPARK-21877: Summary: Windows command script can not handle quotes in parameter Key: SPARK-21877 URL: https://issues.apache.org/jira/browse/SPARK-21877 Project: Spark Issue Type: Bug Components: Deploy, Windows Affects Versions: 2.2.0 Environment: Spark version: spark-2.2.0-bin-hadoop2.7 Windows version: Windows 10 Reporter: Xiaokai Zhao All the windows command scripts can not handle quotes in parameter. Run a windows command shell with parameter which has quotes can reproduce the bug: {quote} C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell --driver-java-options " -Dfile.encoding=utf-8 " 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" --driver-java-options "' is not recognized as an internal or external command, operable program or batch file. {quote} Windows recognize "--driver-java-options" as part of the command. All the Windows command script has the following code have the bug. {quote} cmd /V /E /C "" %* {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148268#comment-16148268 ] Joseph K. Bradley commented on SPARK-21866: --- It's a valid question, but overall, I'd support this effort. My thoughts: Summary: Image processing use cases have become increasingly important, especially because of the rise of interest in deep learning. It's valuable to standardize around a common format, partly for users and partly for developers. Q: Are images a common data type? I.e., if we were talking about adding support for storing text in Spark DataFrames, there would be no question that Spark must be able to handle text since it is such a common data format. Are images common enough to merit inclusion in Spark? A: I'd argue yes, partly because of the rise in requests around it. But also, if it makes sense for a general purpose language like Java to contain image formats, then it likewise makes sense for a general purpose data processing library like Spark to contain image formats. This does not duplicate functionality from java.awt (or other libraries) since the key elements being added here are Spark-specific: a Spark DataFrame schema and a Spark Data Source. Q: Will leaving this functionality in a package, rather than putting it in Spark, be sufficient? A: I worry that this will limit adoption, as well as community oversight of such a core piece of functionality. Tooling built upon image formats, including image processing algorithms, could live outside of Spark, but basic image loading and saving should IMO live in Spark. Q: Will users really benefit? A: My main reason to support this is confusion I've heard about the right way to handle images in Spark. They are sometimes handled outside of Spark's data model (often giving up proper resilience guarantees), are handled by falling back to the RDD API, etc. I hope that standardization will simplify life for users (clarifying and standardizing APIs) and library developers (facilitating collaboration on image ETL). > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with ima
[jira] [Assigned] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21875: Assignee: Andrew Ash > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Assignee: Andrew Ash >Priority: Trivial > Fix For: 2.3.0 > > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21875. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19088 [https://github.com/apache/spark/pull/19088] > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Trivial > Fix For: 2.3.0 > > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20928) Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148227#comment-16148227 ] Michael Armbrust edited comment on SPARK-20928 at 8/30/17 11:52 PM: Hey everyone, thanks for your interest in this feature! I'm still targeting Spark 2.3, but unfortunately have been busy with other things since the summit. Will post more details on the design as soon as we have them! The Spark summit demo just showed a hacked-together prototype, but we need to do more to figure out how to best integrate it into Spark. was (Author: marmbrus): Hey everyone, thanks for your interest in this feature! I'm still targeting Spark 2.3, but unfortunately have been busy with other things since the summit. Will post more details on the design as soon as we have them! > Continuous Processing Mode for Structured Streaming > --- > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148227#comment-16148227 ] Michael Armbrust commented on SPARK-20928: -- Hey everyone, thanks for your interest in this feature! I'm still targeting Spark 2.3, but unfortunately have been busy with other things since the summit. Will post more details on the design as soon as we have them! > Continuous Processing Mode for Structured Streaming > --- > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21839. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19055 [https://github.com/apache/spark/pull/19055] > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > Fix For: 2.3.0 > > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression
[ https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21839: Assignee: Dongjoon Hyun > Support SQL config for ORC compression > --- > > Key: SPARK-21839 > URL: https://issues.apache.org/jira/browse/SPARK-21839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > This issue aims to provide `spark.sql.orc.compression.codec` like > `spark.sql.parquet.compression.codec`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148093#comment-16148093 ] Marcelo Vanzin commented on SPARK-18085: [~jincheng] do you have code that can reproduce that? The code in the exception hasn't really changed, so this is probably an artifact of how the new code is recording data from the application. For your description (when we clicking stages with no tasks successful) I haven't been able to reproduce this; a stage that has only failed tasks still renders fine with the code in my branch. > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148090#comment-16148090 ] Reynold Xin commented on SPARK-21867: - This makes sense. The devil is in the details though (e.g. how complicated it is). Can you create a PR for your prototype code just to illustrate the implementation more? > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21866: -- Target Version/s: (was: 2.3.0) > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about the origin of the image. The
[jira] [Resolved] (SPARK-21834) Incorrect executor request in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21834. Resolution: Fixed Assignee: Sital Kedia Fix Version/s: 2.3.0 2.2.1 2.1.2 > Incorrect executor request in case of dynamic allocation > > > Key: SPARK-21834 > URL: https://issues.apache.org/jira/browse/SPARK-21834 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > killExecutor api currently does not allow killing an executor without > updating the total number of executors needed. In case of dynamic allocation > is turned on and the allocator tries to kill an executor, the scheduler > reduces the total number of executors needed ( see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) > which is incorrect because the allocator already takes care of setting the > required number of executors itself. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21876) Idling Executors that never handled any tasks are not cleared from BlockManager after being removed
Julie Zhang created SPARK-21876: --- Summary: Idling Executors that never handled any tasks are not cleared from BlockManager after being removed Key: SPARK-21876 URL: https://issues.apache.org/jira/browse/SPARK-21876 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.2.0, 1.6.3 Reporter: Julie Zhang This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We use Yarn as our resource manager. 1) Executor A is launched, but no task has been submitted to it; 2) After 'spark.dynamicAllocation.executorIdleTimeout' seconds, executor A will be removed. (ExecutorAllocationManager.scala schedule(): 294); 3) The scheduler gets notified that executor A has been lost; (in our case, YarnSchedulerBackend.scla: 209). In the TaskschedulerImpl.scala method executorLost(executorId: String, reason: ExecutorLossReason), the assumption in the None case(TaskSchedulerImpl.scala: 548) that the executor has already been removed is not always valid. As a result, the DAGScheduler and BlockManagerMaster are never notified about the loss of executor A. When GC eventually happens, the ContextCleaner will try to clean up un-referenced objects. Because the executor A was not removed from the blockManagerIdByExecutor map, BlockManagerMasterEndpoint will send out requests to clean the references to the non-existent executor, producing a lot of error message like this in the driver log: ERROR [2017-08-08 00:00:23,596] org.apache.spark.network.client.TransportClient: Failed to send RPC xxx to xxx/xxx:x: java.nio.channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147987#comment-16147987 ] Apache Spark commented on SPARK-21728: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19089 > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > Attachments: logging.patch, sparksubmit.patch > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21728: Assignee: Apache Spark (was: Marcelo Vanzin) > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > Fix For: 2.3.0 > > Attachments: logging.patch, sparksubmit.patch > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21728: Assignee: Marcelo Vanzin (was: Apache Spark) > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > Attachments: logging.patch, sparksubmit.patch > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21875: Assignee: (was: Apache Spark) > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Trivial > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147976#comment-16147976 ] Andrew Ash commented on SPARK-21875: I'd be interested in more details on why it can't be run in the PR builder -- I have the full `./dev/run-tests` running in CI and it catches things like this occasionally PR at https://github.com/apache/spark/pull/19088 > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Trivial > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21875: Assignee: Apache Spark > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Assignee: Apache Spark >Priority: Trivial > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147975#comment-16147975 ] Apache Spark commented on SPARK-21875: -- User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/19088 > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Trivial > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-21728: Attachment: logging.patch sparksubmit.patch > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > Attachments: logging.patch, sparksubmit.patch > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147957#comment-16147957 ] Jacek Laskowski commented on SPARK-21728: - After I changed your change, I could see the logs again. No idea if the changes made sense or not, but see the logs and that counts :) I'm attaching my changes if these could help you somehow. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-21728: > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147944#comment-16147944 ] Marcelo Vanzin commented on SPARK-21728: Ok, with your file and the streaming example from Spark docs I see it's picking up the wrong log level. I also ran into a separate issue with my patch; let me re-open this and fix these issues. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again
[ https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-21714: --- Fix Version/s: 2.2.1 > SparkSubmit in Yarn Client mode downloads remote files and then reuploads > them again > > > Key: SPARK-21714 > URL: https://issues.apache.org/jira/browse/SPARK-21714 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Saisai Shao >Priority: Critical > Fix For: 2.2.1, 2.3.0 > > > SPARK-10643 added the ability for spark-submit to download remote file in > client mode. > However in yarn mode this introduced a bug where it downloads them for the > client but then yarn client just reuploads them to HDFS and uses them again. > This should not happen when the remote file is HDFS. This is wasting > resources and its defeating the distributed cache because if the original > object was public it would have been shared by many users. By us downloading > and reuploading, it becomes private. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21875: -- Priority: Trivial (was: Major) Issue Type: Improvement (was: Bug) (Certianly not a Major Bug) Yes, for Historical Reasons I can't find a link to, we can't run it in the PR builder. We just fix these periodically. Go ahead. > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Trivial > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
Andrew Ash created SPARK-21875: -- Summary: Jenkins passes Java code that violates ./dev/lint-java Key: SPARK-21875 URL: https://issues.apache.org/jira/browse/SPARK-21875 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: Andrew Ash Two recent PRs merged which caused lint-java errors: {noformat} Running Java style checks Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] (sizes) LineLength: Line is longer than 100 characters (found 106). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] (sizes) LineLength: Line is longer than 100 characters (found 106). [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 {noformat} The first error is from https://github.com/apache/spark/pull/19025 and the second is from https://github.com/apache/spark/pull/18488 Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21841) Spark SQL doesn't pick up column added in hive when table created with saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-21841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147873#comment-16147873 ] Marcelo Vanzin commented on SPARK-21841: Good to know there's a way to say "I want a proper Hive table" in 2.2, even if the API is a little confusing for the user. Too many people just use {{saveAsTable}} without really understanding what it means for Hive compatibility. It might even make more sense to not even try to save a Hive compatible table for other formats, although that might have backwards compatibility issues. > Spark SQL doesn't pick up column added in hive when table created with > saveAsTable > -- > > Key: SPARK-21841 > URL: https://issues.apache.org/jira/browse/SPARK-21841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Thomas Graves > > If you create a table in Spark sql but then you modify the table in hive to > add a column, spark sql doesn't pick up the new column. > Basic example: > {code} > t1 = spark.sql("select ip_address from mydb.test_table limit 1") > t1.show() > ++ > | ip_address| > ++ > |1.30.25.5| > ++ > t1.write.saveAsTable('mydb.t1') > In Hive: > alter table mydb.t1 add columns (bcookie string) > t1 = spark.table("mydb.t1") > t1.show() > ++ > | ip_address| > ++ > |1.30.25.5| > ++ > {code} > It looks like its because spark sql is picking up the schema from > spark.sql.sources.schema.part.0 rather then from hive. > Interestingly enough it appears that if you create the table differently like: > spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit > 1") > Run your alter table on mydb.t1 > val t1 = spark.table("mydb.t1") > Then it works properly. > It looks like the difference is when it doesn't work > spark.sql.sources.provider=parquet is set. > Its doing this from createDataSourceTable where provider is parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147840#comment-16147840 ] Jacek Laskowski commented on SPARK-21728: - The idea behind the custom {{conf/log4j.properties}} is to disable all the logging and enable only {{org.apache.spark.sql.execution.streaming}} currently. {code} $ cat conf/log4j.properties # Set everything to be logged to the console log4j.rootCategory=OFF, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=WARN # Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=WARN log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR #log4j.logger.org.apache.spark=OFF log4j.logger.org.apache.spark.metrics.MetricsSystem=WARN # Structured Streaming log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG log4j.logger.org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec=INFO {code} > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21865) simplify the distribution semantic of Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-21865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-21865: Summary: simplify the distribution semantic of Spark SQL (was: remove Partitioning.compatibleWith) > simplify the distribution semantic of Spark SQL > --- > > Key: SPARK-21865 > URL: https://issues.apache.org/jira/browse/SPARK-21865 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147655#comment-16147655 ] Marcelo Vanzin commented on SPARK-21728: Can you share your whole command line? I just added this to my {{conf/log4j.properties}}: {noformat} log4j.logger.org.apache.spark.deploy=DEBUG {noformat} And ran: {noformat} ./bin/spark-shell --master 'local-cluster[1,1,1024]' {noformat} And I see the log messages that I could not see without the custom setting. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147643#comment-16147643 ] Jacek Laskowski commented on SPARK-21728: - Thanks [~vanzin] for the prompt response! I'm stuck with the change as {{conf/log4j.properties}} has no effect on logging and given the change touched it I think it's the root cause (I might be mistaken, but looking for help to find it). The following worked fine two days ago (not sure about yesterday's build). Is {{conf/log4j.properties}} still the file for logging? {code} # Structured Streaming log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG {code} > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147617#comment-16147617 ] Marcelo Vanzin commented on SPARK-21728: What is user-visible here (other than potentially some more log messages popping up)? Logging should work just like before as far as the user is concerned. Their customized logging configuration and everything should still work; if that doesn't work it's a bug (although I remember testing that). > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21055) Support grouping__id
[ https://issues.apache.org/jira/browse/SPARK-21055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147433#comment-16147433 ] Apache Spark commented on SPARK-21055: -- User 'cenyuhai' has created a pull request for this issue: https://github.com/apache/spark/pull/19087 > Support grouping__id > > > Key: SPARK-21055 > URL: https://issues.apache.org/jira/browse/SPARK-21055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark2.1.1 >Reporter: cen yuhai > > Now, spark doesn't support grouping__id, spark provide another function > grouping_id() to workaround. > If use grouping_id(), many scripts need to change and supporting > grouping__id is very easy, why not? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21874) Support changing database when rename table.
[ https://issues.apache.org/jira/browse/SPARK-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147429#comment-16147429 ] Apache Spark commented on SPARK-21874: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/19086 > Support changing database when rename table. > > > Key: SPARK-21874 > URL: https://issues.apache.org/jira/browse/SPARK-21874 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > Change database of table by renaming is widely used in `Hive`. We can try add > this function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21874) Support changing database when rename table.
[ https://issues.apache.org/jira/browse/SPARK-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21874: Assignee: (was: Apache Spark) > Support changing database when rename table. > > > Key: SPARK-21874 > URL: https://issues.apache.org/jira/browse/SPARK-21874 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > Change database of table by renaming is widely used in `Hive`. We can try add > this function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21874) Support changing database when rename table.
[ https://issues.apache.org/jira/browse/SPARK-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21874: Assignee: Apache Spark > Support changing database when rename table. > > > Key: SPARK-21874 > URL: https://issues.apache.org/jira/browse/SPARK-21874 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: Apache Spark > > Change database of table by renaming is widely used in `Hive`. We can try add > this function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21453) Cached Kafka consumer may be closed too early
[ https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147419#comment-16147419 ] Pablo Panero commented on SPARK-21453: -- [~zsxwing] Failed again but not even there is conext around it. Will try running in local for long. Maybe Im missing something {code} 17/08/29 04:47:45 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 17/08/29 04:47:45 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 17/08/29 04:47:45 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 17/08/29 10:41:13 WARN SslTransportLayer: Failed to send SSL Close message java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195) at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163) at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731) at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54) at org.apache.kafka.common.network.Selector.doClose(Selector.java:540) at org.apache.kafka.common.network.Selector.close(Selector.java:531) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378) at org.apache.kafka.common.network.Selector.poll(Selector.java:303) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.kafka010.KafkaWriteTask.execute(KafkaWriteTask.scala:47) at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcV$sp(KafkaWriter.scala:91) at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KafkaWriter.scala:91) at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KafkaWriter.scala:91) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KafkaWriter.scala:91) at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KafkaWriter.scala:89) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org
[jira] [Created] (SPARK-21874) Support changing database when rename table.
jin xing created SPARK-21874: Summary: Support changing database when rename table. Key: SPARK-21874 URL: https://issues.apache.org/jira/browse/SPARK-21874 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: jin xing Change database of table by renaming is widely used in `Hive`. We can try add this function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE
[ https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147393#comment-16147393 ] Jonas Fonseca commented on SPARK-18859: --- For postgresql using {{nullif}} can be used as a workaround: {noformat} ( select t.id, t.age, nullif(j.name, null) as name from masterdata.testtable t left join masterdata.jointable j on t.id = j.id ) as testtable; {noformat} Tested with org.postgresql:postgresql:42.0.0 > Catalyst codegen does not mark column as nullable when it should. Causes NPE > > > Key: SPARK-18859 > URL: https://issues.apache.org/jira/browse/SPARK-18859 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.2 >Reporter: Mykhailo Osypov >Priority: Critical > > When joining two tables via LEFT JOIN, columns in right table may be NULLs, > however catalyst codegen cannot recognize it. > Example: > {code:title=schema.sql|borderStyle=solid} > create table masterdata.testtable( > id int not null, > age int > ); > create table masterdata.jointable( > id int not null, > name text not null > ); > {code} > {code:title=query_to_select.sql|borderStyle=solid} > (select t.id, t.age, j.name from masterdata.testtable t left join > masterdata.jointable j on t.id = j.id) as testtable; > {code} > {code:title=master code|borderStyle=solid} > val df = sqlContext > .read > .format("jdbc") > .option("dbTable", "query to select") > > .load > //df generated schema > /* > root > |-- id: integer (nullable = false) > |-- age: integer (nullable = true) > |-- name: string (nullable = false) > */ > {code} > {code:title=Codegen|borderStyle=solid} > /* 038 */ scan_rowWriter.write(0, scan_value); > /* 039 */ > /* 040 */ if (scan_isNull1) { > /* 041 */ scan_rowWriter.setNullAt(1); > /* 042 */ } else { > /* 043 */ scan_rowWriter.write(1, scan_value1); > /* 044 */ } > /* 045 */ > /* 046 */ scan_rowWriter.write(2, scan_value2); > {code} > Since *j.name* is from right table of *left join* query, it may be null. > However generated schema doesn't think so (probably because it defined as > *name text not null*) > {code:title=StackTrace|borderStyle=solid} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21469) Add doc and example for FeatureHasher
[ https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21469. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19024 [https://github.com/apache/spark/pull/19024] > Add doc and example for FeatureHasher > - > > Key: SPARK-21469 > URL: https://issues.apache.org/jira/browse/SPARK-21469 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath > Fix For: 2.3.0 > > > Add examples and user guide section for {{FeatureHasher}} in SPARK-13969 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21469) Add doc and example for FeatureHasher
[ https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21469: -- Assignee: Bryan Cutler > Add doc and example for FeatureHasher > - > > Key: SPARK-21469 > URL: https://issues.apache.org/jira/browse/SPARK-21469 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Bryan Cutler > Fix For: 2.3.0 > > > Add examples and user guide section for {{FeatureHasher}} in SPARK-13969 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136567#comment-16136567 ] Vinayak edited comment on SPARK-18350 at 8/30/17 12:56 PM: --- [~ueshin] I have set the below value to set the timeZone to UTC. It is adding the current timeZone value even though it is in the UTC format. spark.conf.set("spark.sql.session.timeZone", "UTC") Find the attached csv data for reference. Expected : Time should remain same as the input since it's already in UTC format var df1 = spark.read.option("delimiter", ",").option("qualifier", "\"").option("inferSchema","true").option("header", "true").option("mode", "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat", "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv"); df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields] scala> df1.show(false); ++---++---+---+--+---+ |Name|Age|Add |Date |SparkDate |SparkDate1 |SparkDate2 | ++---++---+---+--+---+ |abc |21 |bvxc|04/22/2017T03:30:02|2017-03-21 03:30:02|2017-03-21 09:00:02.02|2017-03-21 05:30:00| ++---++---+---+--+---+ was (Author: vinayaksgadag): [~ueshin] I have set the below value to set the timeZone to UTC. It is adding the current timeZone value even though it is in the UTC format. spark.conf.set("spark.sql.session.timeZone", "UTC") Find the attached csv data for reference. Expected : Time should remain same as the input since it's already in UTC format var df1 = spark.read.option("delimiter", ",").option("qualifier", "\"").option("inferSchema","true").option("header", "true").option("mode", "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat", "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv"); df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields] scala> df1.show(false); -- Name Age Add Date SparkDate SparkDate1 SparkDate2 -- abc 21 bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 2017-03-21 05:30:00 -- > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Labels: releasenotes > Fix For: 2.2.0 > > Attachments: sample.csv > > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136567#comment-16136567 ] Vinayak edited comment on SPARK-18350 at 8/30/17 12:55 PM: --- [~ueshin] I have set the below value to set the timeZone to UTC. It is adding the current timeZone value even though it is in the UTC format. spark.conf.set("spark.sql.session.timeZone", "UTC") Find the attached csv data for reference. Expected : Time should remain same as the input since it's already in UTC format var df1 = spark.read.option("delimiter", ",").option("qualifier", "\"").option("inferSchema","true").option("header", "true").option("mode", "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat", "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv"); df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields] scala> df1.show(false); -- Name Age Add Date SparkDate SparkDate1 SparkDate2 -- abc 21 bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 2017-03-21 05:30:00 -- was (Author: vinayaksgadag): [~ueshin] I have set the below value to set the timeZone to UTC. It is adding the current timeZone value even though it is in the UTC format. spark.conf.set("spark.sql.session.timeZone", "UTC") Find the attached csv data for reference. Expected : Time should remain same as the input since it's already in UTC format var df1 = spark.read.option("delimiter", ",").option("qualifier", "\"").option("inferSchema","true").option("header", "true").option("mode", "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat", "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv"); df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields] scala> df1.show(false); -- Name Age Add Date SparkDate SparkDate1 SparkDate2 -- abc 21 bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 2017-03-21 05:30:00 > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Labels: releasenotes > Fix For: 2.2.0 > > Attachments: sample.csv > > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
[ https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21764: Assignee: Hyukjin Kwon > Tests failures on Windows: resources not being closed and incorrect paths > - > > Key: SPARK-21764 > URL: https://issues.apache.org/jira/browse/SPARK-21764 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 > but decided to open another one here, targeting 2.3.0 as fixed version. > In short, there are many test failures on Windows, mainly due to resources > not being closed but attempted to be removed (this is failed on Windows) and > incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths
[ https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21764. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18971 [https://github.com/apache/spark/pull/18971] > Tests failures on Windows: resources not being closed and incorrect paths > - > > Key: SPARK-21764 > URL: https://issues.apache.org/jira/browse/SPARK-21764 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 > but decided to open another one here, targeting 2.3.0 as fixed version. > In short, there are many test failures on Windows, mainly due to resources > not being closed but attempted to be removed (this is failed on Windows) and > incorrect path inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"
[ https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16643. --- Resolution: Won't Fix If this only affects 1.5/1.6, I think it'd be WontFix at this point. Update to Spark 2. > When doing Shuffle, report "java.io.FileNotFoundException" > -- > > Key: SPARK-16643 > URL: https://issues.apache.org/jira/browse/SPARK-16643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 > Environment: LSB Version: > :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch > Distributor ID: CentOS > Description: CentOS release 6.6 (Final) > Release: 6.6 > Codename: Final > java version "1.7.0_10" > Java(TM) SE Runtime Environment (build 1.7.0_10-b18) > Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) >Reporter: Deng Changchun > > In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, > such some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 > and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000" > at the begining all is good, however after about 15 days, when execute the > aggreate sqls, it will report error, the log looks like: > 【Notice: > it is very strange that it won't report error every time when executing > aggreate sql, let's say random, after executing some aggregate sqls, it will > log error by chance.】 > 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] > executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = > 624 > 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] > executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624) > java.io.FileNotFoundException: > /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee > (没有那个文件或目录) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:212) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"
[ https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147139#comment-16147139 ] Sajith Dimal commented on SPARK-16643: -- We observed this in spark version 1.6.2 as well, please find the bellow error log: TID: [-1] [] [2017-08-01 22:05:16,768] ERROR {org.apache.spark.executor.Executor} - Exception in task 0.0 in stage 6.0 (TID 6) {org.apache.spark.executor.Executor} java.io.FileNotFoundException: /tmp/blockmgr-d44d050f-8727-4f96-83f5-69e3281d7aa5/39/temp_shuffle_3145a66b-1823-4c82-a7ac-2ac55fd5726e (Stale file handle) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:206) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > When doing Shuffle, report "java.io.FileNotFoundException" > -- > > Key: SPARK-16643 > URL: https://issues.apache.org/jira/browse/SPARK-16643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 > Environment: LSB Version: > :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch > Distributor ID: CentOS > Description: CentOS release 6.6 (Final) > Release: 6.6 > Codename: Final > java version "1.7.0_10" > Java(TM) SE Runtime Environment (build 1.7.0_10-b18) > Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) >Reporter: Deng Changchun > > In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, > such some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 > and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000" > at the begining all is good, however after about 15 days, when execute the > aggreate sqls, it will report error, the log looks like: > 【Notice: > it is very strange that it won't report error every time when executing > aggreate sql, let's say random, after executing some aggregate sqls, it will > log error by chance.】 > 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] > executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = > 624 > 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] > executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624) > java.io.FileNotFoundException: > /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee > (没有那个文件或目录) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:212) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147086#comment-16147086 ] Jacek Laskowski commented on SPARK-21728: - Thanks [~sowen]. I'll label it as such when I know if it merits one (looks so, but waiting for a response from [~vanzin] or others who'd know). Sent out an email to the Spark user mailing list today. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
[ https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21806. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19038 [https://github.com/apache/spark/pull/19038] > BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading > -- > > Key: SPARK-21806 > URL: https://issues.apache.org/jira/browse/SPARK-21806 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Marc Kaminski >Assignee: Sean Owen >Priority: Minor > Labels: releasenotes > Fix For: 2.3.0 > > Attachments: PRROC_example.jpeg > > > I would like to reference to a [discussion in scikit-learn| > https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior > is probably based on the scikit implementation. > Summary: > Currently, the y-axis intercept of the precision recall curve is set to (0.0, > 1.0). This behavior is not ideal in certain edge cases (see example below) > and can also have an impact on cross validation, when optimization metric is > set to "areaUnderPR". > Please consider [blucena's > post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613] > for possible alternatives. > Edge case example: > Consider a bad classifier, that assigns a high probability to all samples. A > possible output might look like this: > ||Real label || Score || > |1.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 0.95 | > |0.0 | 0.95 | > |1.0 | 1.0 | > This results in the following pr points (first line set by default): > ||Threshold || Recall ||Precision || > |1.0 | 0.0 | 1.0 | > |0.95| 1.0 | 0.2 | > |0.0| 1.0 | 0,16 | > The auPRC would be around 0.6. Classifiers with a more differentiated > probability assignment will be falsely assumed to perform worse in regard to > this auPRC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147037#comment-16147037 ] Sean Owen commented on SPARK-21728: --- You can label such issues with 'releasenotes'. I consider that for breaking changes or bug fixes that change behavior. I don't know if this is one of the few important items I'd highlight. But if someone's in doubt, tag it, and the release manager can judge. > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
[ https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21806: - Assignee: Sean Owen Labels: releasenotes (was: ) > BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading > -- > > Key: SPARK-21806 > URL: https://issues.apache.org/jira/browse/SPARK-21806 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Marc Kaminski >Assignee: Sean Owen >Priority: Minor > Labels: releasenotes > Attachments: PRROC_example.jpeg > > > I would like to reference to a [discussion in scikit-learn| > https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior > is probably based on the scikit implementation. > Summary: > Currently, the y-axis intercept of the precision recall curve is set to (0.0, > 1.0). This behavior is not ideal in certain edge cases (see example below) > and can also have an impact on cross validation, when optimization metric is > set to "areaUnderPR". > Please consider [blucena's > post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613] > for possible alternatives. > Edge case example: > Consider a bad classifier, that assigns a high probability to all samples. A > possible output might look like this: > ||Real label || Score || > |1.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 1.0 | > |0.0 | 0.95 | > |0.0 | 0.95 | > |1.0 | 1.0 | > This results in the following pr points (first line set by default): > ||Threshold || Recall ||Precision || > |1.0 | 0.0 | 1.0 | > |0.95| 1.0 | 0.2 | > |0.0| 1.0 | 0,16 | > The auPRC would be around 0.6. Classifiers with a more differentiated > probability assignment will be falsely assumed to perform worse in regard to > this auPRC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21829) Enable config to permanently blacklist a list of nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21829. --- Resolution: Won't Fix > Enable config to permanently blacklist a list of nodes > -- > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21873. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19059 [https://github.com/apache/spark/pull/19059] > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:java} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:java} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21873: - Assignee: Yuval Itzchakov Target Version/s: (was: 2.2.1) > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Assignee: Yuval Itzchakov >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:java} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:java} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21873: -- Issue Type: Improvement (was: Bug) > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:java} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:java} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray
[ https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21534: Assignee: (was: Apache Spark) > PickleException when creating dataframe from python row with empty bytearray > > > Key: SPARK-21534 > URL: https://issues.apache.org/jira/browse/SPARK-21534 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > {code} > spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: > {"abc": x.xx})).show() > {code} > This code creates exception. It looks like corner-case. > {code} > net.razorvine.pickle.PickleException: invalid pickle data for bytearray; > expected 1 or 2 args, got 0 > at > net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset.org$apac
[jira] [Commented] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray
[ https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146925#comment-16146925 ] Apache Spark commented on SPARK-21534: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19085 > PickleException when creating dataframe from python row with empty bytearray > > > Key: SPARK-21534 > URL: https://issues.apache.org/jira/browse/SPARK-21534 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > {code} > spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: > {"abc": x.xx})).show() > {code} > This code creates exception. It looks like corner-case. > {code} > net.razorvine.pickle.PickleException: invalid pickle data for bytearray; > expected 1 or 2 args, got 0 > at > net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) > at > org.apache.spark.
[jira] [Assigned] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray
[ https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21534: Assignee: Apache Spark > PickleException when creating dataframe from python row with empty bytearray > > > Key: SPARK-21534 > URL: https://issues.apache.org/jira/browse/SPARK-21534 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > > {code} > spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: > {"abc": x.xx})).show() > {code} > This code creates exception. It looks like corner-case. > {code} > net.razorvine.pickle.PickleException: invalid pickle data for bytearray; > expected 1 or 2 args, got 0 > at > net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.s
[jira] [Updated] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuval Itzchakov updated SPARK-21873: Description: In Scala, using `return` inside a function causes a `NonLocalReturnControl` exception to be thrown and caught in order to escape the current scope. While profiling Structured Streaming in production, it clearly shows: !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! This happens during a 1 minute profiling session on a single executor. The code is: {code:java} while (toFetchOffset != UNKNOWN_OFFSET) { try { return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss) } catch { case e: OffsetOutOfRangeException => // When there is some error thrown, it's better to use a new consumer to drop all cached // states in the old consumer. We don't need to worry about the performance because this // is not a common path. resetConsumer() reportDataLoss(failOnDataLoss, s"Cannot fetch offset $toFetchOffset", e) toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, untilOffset) } } {code} This happens because this method is converted to a function which is ran inside: {code:java} private def runUninterruptiblyIfPossible[T](body: => T): T {code} We should avoid using `return` in general, and here specifically as it is a hot path for applications using Kafka. was: In Scala, using `return` inside a function causes a `NonLocalReturnControl` exception to be thrown and caught in order to escape the current scope. While profiling Structured Streaming in production, it clearly shows: !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! This happens during a 1 minute profiling session on a single executor. The code is: {code:scala} while (toFetchOffset != UNKNOWN_OFFSET) { try { return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss) } catch { case e: OffsetOutOfRangeException => // When there is some error thrown, it's better to use a new consumer to drop all cached // states in the old consumer. We don't need to worry about the performance because this // is not a common path. resetConsumer() reportDataLoss(failOnDataLoss, s"Cannot fetch offset $toFetchOffset", e) toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, untilOffset) } } {code} This happens because this method is converted to a function which is ran inside: {code:scala} private def runUninterruptiblyIfPossible[T](body: => T): T {code} We should avoid using `return` in general, and here specifically as it is a hot path for applications using Kafka. > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:java} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:java} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) -
[jira] [Assigned] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21873: Assignee: (was: Apache Spark) > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:scala} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:scala} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21873: Assignee: Apache Spark > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Assignee: Apache Spark >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:scala} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:scala} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146901#comment-16146901 ] Apache Spark commented on SPARK-21873: -- User 'YuvalItzchakov' has created a pull request for this issue: https://github.com/apache/spark/pull/19059 > CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka > --- > > Key: SPARK-21873 > URL: https://issues.apache.org/jira/browse/SPARK-21873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Yuval Itzchakov >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > In Scala, using `return` inside a function causes a `NonLocalReturnControl` > exception to be thrown and caught in order to escape the current scope. > While profiling Structured Streaming in production, it clearly shows: > !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! > This happens during a 1 minute profiling session on a single executor. The > code is: > {code:scala} > while (toFetchOffset != UNKNOWN_OFFSET) { > try { > return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, > failOnDataLoss) > } catch { > case e: OffsetOutOfRangeException => > // When there is some error thrown, it's better to use a new > consumer to drop all cached > // states in the old consumer. We don't need to worry about the > performance because this > // is not a common path. > resetConsumer() > reportDataLoss(failOnDataLoss, s"Cannot fetch offset > $toFetchOffset", e) > toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, > untilOffset) > } > } > {code} > This happens because this method is converted to a function which is ran > inside: > {code:scala} > private def runUninterruptiblyIfPossible[T](body: => T): T > {code} > We should avoid using `return` in general, and here specifically as it is a > hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21254: -- Fix Version/s: 2.2.1 > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik >Assignee: Dmitry Parfenchik >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
Yuval Itzchakov created SPARK-21873: --- Summary: CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka Key: SPARK-21873 URL: https://issues.apache.org/jira/browse/SPARK-21873 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0, 2.1.1, 2.1.0 Reporter: Yuval Itzchakov Priority: Minor In Scala, using `return` inside a function causes a `NonLocalReturnControl` exception to be thrown and caught in order to escape the current scope. While profiling Structured Streaming in production, it clearly shows: !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png! This happens during a 1 minute profiling session on a single executor. The code is: {code:scala} while (toFetchOffset != UNKNOWN_OFFSET) { try { return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss) } catch { case e: OffsetOutOfRangeException => // When there is some error thrown, it's better to use a new consumer to drop all cached // states in the old consumer. We don't need to worry about the performance because this // is not a common path. resetConsumer() reportDataLoss(failOnDataLoss, s"Cannot fetch offset $toFetchOffset", e) toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, untilOffset) } } {code} This happens because this method is converted to a function which is ran inside: {code:scala} private def runUninterruptiblyIfPossible[T](body: => T): T {code} We should avoid using `return` in general, and here specifically as it is a hot path for applications using Kafka. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21872) Is job duration value of Spark Jobs page on Web UI correct?
[ https://issues.apache.org/jira/browse/SPARK-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21872. --- Resolution: Not A Problem (Questions belong on the mailing list to start.) Yes, because in general an app is frequently waiting on resources throughout its execution, like for example waiting for executor slots to free up. This should be counted as part of its execution time. > Is job duration value of Spark Jobs page on Web UI correct? > > > Key: SPARK-21872 > URL: https://issues.apache.org/jira/browse/SPARK-21872 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: iamhumanbeing >Priority: Minor > > I have submitted 2 spark jobs at the same time. but only one get running, the > other is waiting for resources. but the Web UI display that both of the jobs > is running. the job waiting for resources have the duration values increase. > So, Job 7 only runing 14s, but duration value is 29s. > Active Jobs (2) > Job Id ▾Description Submitted > Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total > 7 (kill)count at :30 2017/08/30 11:33:46 7 s 0/1 > 0/100 > 6 (kill)count at :30 2017/08/30 11:33:46 8 s 0/2 > 15/127 (2 running) > after job finished > 7 count at :30 2017/08/30 11:33:46 29 s1/1 100/100 > 6 count at :30 2017/08/30 11:33:46 16 s1/1 (1 skipped) > 27/27 (100 skipped) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146889#comment-16146889 ] Prashant Sharma commented on SPARK-21869: - Yes, looking at it. > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging
[ https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146780#comment-16146780 ] Jacek Laskowski commented on SPARK-21728: - I think the change is user-visible and therefore deserves to be included in the release notes for 2.3 (I remember a component or label to mark changes like that in a special way) /cc [~sowen] [~hyukjin.kwon] > Allow SparkSubmit to use logging > > > Key: SPARK-21728 > URL: https://issues.apache.org/jira/browse/SPARK-21728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Currently, code in {{SparkSubmit}} cannot call classes or methods that > initialize the Spark {{Logging}} framework. That is because at that time > {{SparkSubmit}} doesn't yet know which application will run, and logging is > initialized differently for certain special applications (notably, the > shells). > It would be better if either {{SparkSubmit}} did logging initialization > earlier based on the application to be run, or did it in a way that could be > overridden later when the app initializes. > Without this, there are currently a few parts of {{SparkSubmit}} that > duplicates code from other parts of Spark just to avoid logging. For example: > * > [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860] > replicates code from Utils.scala > * > [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54] > replicates code from Utils.scala and installs its own shutdown hook > * a few parts of the code could use {{SparkConf}} but can't right now because > of the logging issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org