date:20170830

[jira] [Assigned] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21880:


Assignee: Apache Spark

> [spark UI]In the SQL table page, modify jobs trace information
> --
>
> Key: SPARK-21880
> URL: https://issues.apache.org/jira/browse/SPARK-21880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: he.qiao
>Assignee: Apache Spark
>Priority: Minor
>
>  I think it makes sense for "jobs" to change to "job id" in the SQL table 
> page. Because when job 5 fails, it's easy to misunderstand that five jobs 
> have failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21880:


Assignee: (was: Apache Spark)

> [spark UI]In the SQL table page, modify jobs trace information
> --
>
> Key: SPARK-21880
> URL: https://issues.apache.org/jira/browse/SPARK-21880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: he.qiao
>Priority: Minor
>
>  I think it makes sense for "jobs" to change to "job id" in the SQL table 
> page. Because when job 5 fails, it's easy to misunderstand that five jobs 
> have failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148521#comment-16148521
 ] 

Apache Spark commented on SPARK-21880:
--

User 'Geek-He' has created a pull request for this issue:
https://github.com/apache/spark/pull/19093

> [spark UI]In the SQL table page, modify jobs trace information
> --
>
> Key: SPARK-21880
> URL: https://issues.apache.org/jira/browse/SPARK-21880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: he.qiao
>Priority: Minor
>
>  I think it makes sense for "jobs" to change to "job id" in the SQL table 
> page. Because when job 5 fails, it's easy to misunderstand that five jobs 
> have failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21880) [spark UI]In the SQL table page, modify jobs trace information

2017-08-30 Thread he.qiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

he.qiao updated SPARK-21880:

Summary: [spark UI]In the SQL table page, modify jobs trace information  
(was: [spark UI]In the SQL table page, )

> [spark UI]In the SQL table page, modify jobs trace information
> --
>
> Key: SPARK-21880
> URL: https://issues.apache.org/jira/browse/SPARK-21880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: he.qiao
>Priority: Minor
>
>  I think it makes sense for "jobs" to change to "job id" in the SQL table 
> page. Because when job 5 fails, it's easy to misunderstand that five jobs 
> have failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21880) [spark UI]In the SQL table page,

2017-08-30 Thread he.qiao (JIRA)

he.qiao created SPARK-21880:
---

 Summary: [spark UI]In the SQL table page, 
 Key: SPARK-21880
 URL: https://issues.apache.org/jira/browse/SPARK-21880
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.2.0
Reporter: he.qiao
Priority: Minor


 I think it makes sense for "jobs" to change to "job id" in the SQL table page. 
Because when job 5 fails, it's easy to misunderstand that five jobs have failed.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21879) Should Scalers handel NaN values?

2017-08-30 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-21879:


 Summary: Should Scalers handel NaN values?
 Key: SPARK-21879
 URL: https://issues.apache.org/jira/browse/SPARK-21879
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng


The way {{ML.Scalers}} handling {{NaN}} is somewhat unexpected. Current impl of 
{{MinMaxScaler}}/{{MaxAbsScaler}}/{{StandardScaler}} all support {{fit}} and 
{{transform}} on a dataset containing {{NaN}}.
Note that values in the second column in the following dataframe are all 
{{NaN}}, and the coefficients of {{min/max}} in {{MinMaxScalerModel}} and 
{{maxAbs}} in {{MaxAbsScaler}} are wrong.
{code}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg.{Vector, Vectors}

scala> val data = Array(
 |   Vectors.dense(1, Double.NaN, Double.NaN, 2.0),
 |   Vectors.dense(2, Double.NaN, 0.0, 3.0),
 |   Vectors.dense(3, Double.NaN, 0.0, 1.0),
 |   Vectors.dense(6, Double.NaN, 2.0, Double.NaN)).zipWithIndex
data: Array[(org.apache.spark.ml.linalg.Vector, Int)] = 
Array(([1.0,NaN,NaN,2.0],0), ([2.0,NaN,0.0,3.0],1), ([3.0,NaN,0.0,1.0],2), 
([6.0,NaN,2.0,NaN],3))

scala> val df = data.toSeq.toDF("features", "id")
df: org.apache.spark.sql.DataFrame = [features: vector, id: int]

scala> val scaler = new 
MinMaxScaler().setInputCol("features").setOutputCol("scaled")
scaler: org.apache.spark.ml.feature.MinMaxScaler = minMaxScal_7634802f5c81

scala> val model = scaler.fit(df)
model: org.apache.spark.ml.feature.MinMaxScalerModel = minMaxScal_7634802f5c81

scala> model.originalMax
res1: org.apache.spark.ml.linalg.Vector = [6.0,-1.7976931348623157E308,2.0,3.0]

scala> model.originalMin
res2: org.apache.spark.ml.linalg.Vector = [1.0,1.7976931348623157E308,0.0,1.0]

scala> model.transform(df).select("scaled").collect
res3: Array[org.apache.spark.sql.Row] = Array([[0.0,NaN,NaN,0.5]], 
[[0.2,NaN,0.0,1.0]], [[0.4,NaN,0.0,0.0]], [[1.0,NaN,1.0,NaN]])



scala> val scaler2 = new 
MaxAbsScaler().setInputCol("features").setOutputCol("scaled")
scaler2: org.apache.spark.ml.feature.MaxAbsScaler = maxAbsScal_5d34fa818229

scala> val model2 = scaler2.fit(df)
model2: org.apache.spark.ml.feature.MaxAbsScalerModel = maxAbsScal_5d34fa818229

scala> model2.maxAbs
res4: org.apache.spark.ml.linalg.Vector = [6.0,1.7976931348623157E308,2.0,3.0]

scala> model2.transform(df).select("scaled").collect
res5: Array[org.apache.spark.sql.Row] = 
Array([[0.1,NaN,NaN,0.]], 
[[0.,NaN,0.0,1.0]], [[0.5,NaN,0.0,0.]], 
[[1.0,NaN,1.0,NaN]])


scala> val scaler3 = new 
StandardScaler().setInputCol("features").setOutputCol("scaled")
scaler3: org.apache.spark.ml.feature.StandardScaler = stdScal_d8509095e860

scala> val model3 = scaler3.fit(df)
model3: org.apache.spark.ml.feature.StandardScalerModel = stdScal_d8509095e860

scala> model3.std
res11: org.apache.spark.ml.linalg.Vector = [2.160246899469287,NaN,NaN,NaN]

scala> model3.mean
res12: org.apache.spark.ml.linalg.Vector = [3.0,NaN,NaN,NaN]

scala> model3.transform(df).select("scaled").collect
res14: Array[org.apache.spark.sql.Row] = 
Array([[0.4629100498862757,NaN,NaN,NaN]], [[0.9258200997725514,NaN,NaN,NaN]], 
[[1.3887301496588271,NaN,NaN,NaN]], [[2.7774602993176543,NaN,NaN,NaN]])

{code}

I then test the scalers in scikit-learn, and they all throw exceptions in both 
{{fit}} and {{transform}}.

{code}
import numpy as np

from sklearn.preprocessing import *

data = np.array([[-1, 2], [-0.5, 6], [0, np.nan], [1, 1.8]])

data2 = np.array([[-1, 2], [-0.5, 6], [0, 2.0], [1, 1.8]])

for scaler in [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), 
RobustScaler()]:
try:
scaler.fit(data)
except:
print('{0}.fit fails'.format(scaler))
model = scaler.fit(data2)
try:
model.transform(data)
except:
print('{0}.transform fails'.format(scaler))

StandardScaler(copy=True, with_mean=True, with_std=True).fit fails
StandardScaler(copy=True, with_mean=True, with_std=True).transform fails
MinMaxScaler(copy=True, feature_range=(0, 1)).fit fails
MinMaxScaler(copy=True, feature_range=(0, 1)).transform fails
MaxAbsScaler(copy=True).fit fails
MaxAbsScaler(copy=True).transform fails
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
   with_scaling=True).fit fails
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
   with_scaling=True).transform fails
{code}

I think the behavior of handling {{NaN}} should keep in line with the impl in 
scikit-learn: the scaled data are likely to be fed into 
classification/regression/clustering or other algs, and it will be dangerous if 
the users are unaware of the `NaN` in scaled data.

There maybe two choices if we decide to change the behavior:
1, add validation for input data and throw exception

[jira] [Assigned] (SPARK-21878) Create SQLMetricsTestUtils

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21878:


Assignee: Xiao Li  (was: Apache Spark)

> Create SQLMetricsTestUtils
> --
>
> Key: SPARK-21878
> URL: https://issues.apache.org/jira/browse/SPARK-21878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific 
> and the other SQLMetrics test cases. 
> Also, move two SQLMetrics test cases from sql/hive to sql/core. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21878) Create SQLMetricsTestUtils

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21878:


Assignee: Apache Spark  (was: Xiao Li)

> Create SQLMetricsTestUtils
> --
>
> Key: SPARK-21878
> URL: https://issues.apache.org/jira/browse/SPARK-21878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific 
> and the other SQLMetrics test cases. 
> Also, move two SQLMetrics test cases from sql/hive to sql/core. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21878) Create SQLMetricsTestUtils

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148494#comment-16148494
 ] 

Apache Spark commented on SPARK-21878:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19092

> Create SQLMetricsTestUtils
> --
>
> Key: SPARK-21878
> URL: https://issues.apache.org/jira/browse/SPARK-21878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific 
> and the other SQLMetrics test cases. 
> Also, move two SQLMetrics test cases from sql/hive to sql/core. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21878) Create SQLMetricsTestUtils

2017-08-30 Thread Xiao Li (JIRA)

Xiao Li created SPARK-21878:
---

 Summary: Create SQLMetricsTestUtils
 Key: SPARK-21878
 URL: https://issues.apache.org/jira/browse/SPARK-21878
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific 
and the other SQLMetrics test cases. 

Also, move two SQLMetrics test cases from sql/hive to sql/core. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition

2017-08-30 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148463#comment-16148463
 ] 

Tejas Patil commented on SPARK-20313:
-

I tried to replicate what you shared in the jira but dont see anything wrong 
with what planner is doing. Comparing both the approaches, `SortMergeJoin` is 
always being picked. The second approach does joins over individual partitions 
one by one and then unions the results. Depending on your data size + configs, 
it might be possible that for your case a hash based join was used which would 
explain why the later approach is faster.

Approach #1
{noformat}
val df1 = hc.sql("SELECT * FROM 
bucketed_partitioned_1").filter(functions.col("ds").between("1", "5"))
val df2 = hc.sql("SELECT * FROM 
bucketed_partitioned_2").filter(functions.col("ds").between("1", "5"))
val df3 = df1.join(df2, Seq("ds", "user_id")).explain(true)

== Physical Plan ==
*Project [ds#38, user_id#36, name#37, name#45]
+- *SortMergeJoin [ds#38, user_id#36], [ds#46, user_id#44], [ds#38, 
user_id#36], [ds#46, user_id#44], Inner
   :- *Sort [ds#38 ASC NULLS FIRST, user_id#36 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(ds#38, user_id#36, 200)
   : +- *Filter isnotnull(user_id#36)
   :+- HiveTableScan [user_id#36, name#37, ds#38], HiveTableRelation 
`default`.`bucketed_partitioned_1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#36, name#37], 
[ds#38], [isnotnull(ds#38), (ds#38 >= 1), (ds#38 <= 5)]
   +- *Sort [ds#46 ASC NULLS FIRST, user_id#44 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(ds#46, user_id#44, 200)
 +- *Filter isnotnull(user_id#44)
+- HiveTableScan [user_id#44, name#45, ds#46], HiveTableRelation 
`default`.`bucketed_partitioned_2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#44, name#45], 
[ds#46], [isnotnull(ds#46), (ds#46 >= 1), (ds#46 <= 5)]
{noformat}

Approach #2
{noformat}
val df1 = hc.sql("SELECT * FROM bucketed_partitioned_1")
val df2 = hc.sql("SELECT * FROM bucketed_partitioned_2")
val dsValues = Seq("-11-11", "-44-44")
val df3 = dsValues.map(dsValue => {
val df1filtered = df1.filter(functions.col("ds") === dsValue)
val df2filtered = df2.filter(functions.col("ds") === dsValue)
df1filtered.join(df2filtered, Seq("user_id")) // part1 removed from join
}).reduce(_ union _)


== Physical Plan ==
Union
:- *Project [user_id#63, name#64, ds#65, name#71, ds#72]
:  +- *SortMergeJoin [user_id#63], [user_id#70], [user_id#63], [user_id#70], 
Inner
: :- *Sort [user_id#63 ASC NULLS FIRST], false, 0
: :  +- Exchange hashpartitioning(user_id#63, 200)
: : +- *Filter isnotnull(user_id#63)
: :+- HiveTableScan [user_id#63, name#64, ds#65], HiveTableRelation 
`default`.`bucketed_partitioned_1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#63, name#64], 
[ds#65], [isnotnull(ds#65), (ds#65 = -11-11)]
: +- *Sort [user_id#70 ASC NULLS FIRST], false, 0
:+- Exchange hashpartitioning(user_id#70, 200)
:   +- *Filter isnotnull(user_id#70)
:  +- HiveTableScan [user_id#70, name#71, ds#72], HiveTableRelation 
`default`.`bucketed_partitioned_2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#70, name#71], 
[ds#72], [isnotnull(ds#72), (ds#72 = -11-11)]
+- *Project [user_id#63, name#64, ds#65, name#71, ds#72]
   +- *SortMergeJoin [user_id#63], [user_id#70], [user_id#63], [user_id#70], 
Inner
  :- *Sort [user_id#63 ASC NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(user_id#63, 200)
  : +- *Filter isnotnull(user_id#63)
  :+- HiveTableScan [user_id#63, name#64, ds#65], HiveTableRelation 
`default`.`bucketed_partitioned_1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#63, name#64], 
[ds#65], [isnotnull(ds#65), (ds#65 = -44-44)]
  +- *Sort [user_id#70 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(user_id#70, 200)
+- *Filter isnotnull(user_id#70)
   +- HiveTableScan [user_id#70, name#71, ds#72], HiveTableRelation 
`default`.`bucketed_partitioned_2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#70, name#71], 
[ds#72], [isnotnull(ds#72), (ds#72 = -44-44)]
{noformat} 

> Possible lack of join optimization when partitions are in the join condition
> 
>
> Key: SPARK-20313
> URL: https://issues.apache.org/jira/browse/SPARK-20313
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> Given two tables T1 and T2, partitioned on column part1, the following have 
> vastly different execution performance:
> // initial, slow
> {noformat}
> val df1 = // load data from

[jira] [Resolved] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-08-30 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-21583.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18787
[https://github.com/apache/spark/pull/18787]

> Create a ColumnarBatch with ArrowColumnVectors for row based iteration
> --
>
> Key: SPARK-21583
> URL: https://issues.apache.org/jira/browse/SPARK-21583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
> It would be useful to be able to create a {{ColumnarBatch}} to allow row 
> based iteration over multiple {{ArrowColumnVectors}}.  This would avoid extra 
> copying to translate column elements into rows and be more efficient memory 
> usage while increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-08-30 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-21583:
-

Assignee: Bryan Cutler

> Create a ColumnarBatch with ArrowColumnVectors for row based iteration
> --
>
> Key: SPARK-21583
> URL: https://issues.apache.org/jira/browse/SPARK-21583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>
> The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
> It would be useful to be able to create a {{ColumnarBatch}} to allow row 
> based iteration over multiple {{ArrowColumnVectors}}.  This would avoid extra 
> copying to translate column elements into rows and be more efficient memory 
> usage while increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21534:


Assignee: Liang-Chi Hsieh

> PickleException when creating dataframe from python row with empty bytearray
> 
>
> Key: SPARK-21534
> URL: https://issues.apache.org/jira/browse/SPARK-21534
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> {code}
> spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
> {"abc": x.xx})).show()
> {code}
> This code creates exception. It looks like corner-case.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
> expected 1 or 2 args, got 0
>   at 
> net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit

[jira] [Resolved] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21534.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19085
[https://github.com/apache/spark/pull/19085]

> PickleException when creating dataframe from python row with empty bytearray
> 
>
> Key: SPARK-21534
> URL: https://issues.apache.org/jira/browse/SPARK-21534
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Fix For: 2.3.0
>
>
> {code}
> spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
> {"abc": x.xx})).show()
> {code}
> This code creates exception. It looks like corner-case.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
> expected 1 or 2 args, got 0
>   at 
> net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
>   at 
> org.apach

[jira] [Commented] (SPARK-16854) mapWithState Support for Python

2017-08-30 Thread Radek Ostrowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148343#comment-16148343
 ] 

Radek Ostrowski commented on SPARK-16854:
-

+1

> mapWithState Support for Python
> ---
>
> Key: SPARK-16854
> URL: https://issues.apache.org/jira/browse/SPARK-16854
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Boaz
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148333#comment-16148333
 ] 

Saisai Shao commented on SPARK-11574:
-

Maybe it is my permission issue, will ask PMC to handle it.

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-30 Thread Xiaofeng Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148328#comment-16148328
 ] 

Xiaofeng Lin commented on SPARK-11574:
--

[~jerryshao], yes my JIRA username is still active. It's "xflin".

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-08-30 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-17321:
---

Assignee: Saisai Shao

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0, 2.1.1
>Reporter: yunjiong zhao
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148319#comment-16148319
 ] 

Hyukjin Kwon commented on SPARK-11574:
--

I can't assign the ID too and I met this issue too - 
https://issues.apache.org/jira/browse/SPARK-21658?focusedCommentId=16125906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16125906
 but I just ended up with leaving it ..

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-08-30 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-17321.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19032
[https://github.com/apache/spark/pull/19032]

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0, 2.1.1
>Reporter: yunjiong zhao
> Fix For: 2.3.0
>
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148292#comment-16148292
 ] 

Saisai Shao commented on SPARK-11574:
-

Hi Xiaofeng, is your JIRA username still available, I cannot assign the JIRA to 
you, since I cannot find your name.

[~srowen] [~hyukjin.kwon] do you know how to handle this situation?

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-30 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-11574.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 9518
[https://github.com/apache/spark/pull/9518]

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21877) Windows command script can not handle quotes in parameter

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21877:


Assignee: Apache Spark

> Windows command script can not handle quotes in parameter
> -
>
> Key: SPARK-21877
> URL: https://issues.apache.org/jira/browse/SPARK-21877
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 2.2.0
> Environment: Spark version: spark-2.2.0-bin-hadoop2.7
> Windows version: Windows 10
>Reporter: Xiaokai Zhao
>Assignee: Apache Spark
>
> All the windows command scripts can not handle quotes in parameter.
> Run a windows command shell with parameter which has quotes can reproduce the 
> bug:
> {quote}
> C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell 
> --driver-java-options " -Dfile.encoding=utf-8 "
> 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" 
> --driver-java-options "' is not recognized as an internal or external command,
> operable program or batch file.
> {quote}
> Windows recognize "--driver-java-options" as part of the command.
> All the Windows command script has the following code have the bug.
> {quote}
> cmd /V /E /C "" %*
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21877) Windows command script can not handle quotes in parameter

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21877:


Assignee: (was: Apache Spark)

> Windows command script can not handle quotes in parameter
> -
>
> Key: SPARK-21877
> URL: https://issues.apache.org/jira/browse/SPARK-21877
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 2.2.0
> Environment: Spark version: spark-2.2.0-bin-hadoop2.7
> Windows version: Windows 10
>Reporter: Xiaokai Zhao
>
> All the windows command scripts can not handle quotes in parameter.
> Run a windows command shell with parameter which has quotes can reproduce the 
> bug:
> {quote}
> C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell 
> --driver-java-options " -Dfile.encoding=utf-8 "
> 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" 
> --driver-java-options "' is not recognized as an internal or external command,
> operable program or batch file.
> {quote}
> Windows recognize "--driver-java-options" as part of the command.
> All the Windows command script has the following code have the bug.
> {quote}
> cmd /V /E /C "" %*
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21877) Windows command script can not handle quotes in parameter

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148278#comment-16148278
 ] 

Apache Spark commented on SPARK-21877:
--

User 'minixalpha' has created a pull request for this issue:
https://github.com/apache/spark/pull/19090

> Windows command script can not handle quotes in parameter
> -
>
> Key: SPARK-21877
> URL: https://issues.apache.org/jira/browse/SPARK-21877
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 2.2.0
> Environment: Spark version: spark-2.2.0-bin-hadoop2.7
> Windows version: Windows 10
>Reporter: Xiaokai Zhao
>
> All the windows command scripts can not handle quotes in parameter.
> Run a windows command shell with parameter which has quotes can reproduce the 
> bug:
> {quote}
> C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell 
> --driver-java-options " -Dfile.encoding=utf-8 "
> 'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" 
> --driver-java-options "' is not recognized as an internal or external command,
> operable program or batch file.
> {quote}
> Windows recognize "--driver-java-options" as part of the command.
> All the Windows command script has the following code have the bug.
> {quote}
> cmd /V /E /C "" %*
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21877) Windows command script can not handle quotes in parameter

2017-08-30 Thread Xiaokai Zhao (JIRA)

Xiaokai Zhao created SPARK-21877:


 Summary: Windows command script can not handle quotes in parameter
 Key: SPARK-21877
 URL: https://issues.apache.org/jira/browse/SPARK-21877
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Windows
Affects Versions: 2.2.0
 Environment: Spark version: spark-2.2.0-bin-hadoop2.7
Windows version: Windows 10
Reporter: Xiaokai Zhao


All the windows command scripts can not handle quotes in parameter.

Run a windows command shell with parameter which has quotes can reproduce the 
bug:

{quote}
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell 
--driver-java-options " -Dfile.encoding=utf-8 "
'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" 
--driver-java-options "' is not recognized as an internal or external command,
operable program or batch file.
{quote}

Windows recognize "--driver-java-options" as part of the command.

All the Windows command script has the following code have the bug.

{quote}
cmd /V /E /C "" %*
{quote}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-30 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148268#comment-16148268
 ] 

Joseph K. Bradley commented on SPARK-21866:
---

It's a valid question, but overall, I'd support this effort.  My thoughts:

Summary: Image processing use cases have become increasingly important, 
especially because of the rise of interest in deep learning.  It's valuable to 
standardize around a common format, partly for users and partly for developers.

Q: Are images a common data type?  I.e., if we were talking about adding 
support for storing text in Spark DataFrames, there would be no question that 
Spark must be able to handle text since it is such a common data format.  Are 
images common enough to merit inclusion in Spark?
A: I'd argue yes, partly because of the rise in requests around it.  But also, 
if it makes sense for a general purpose language like Java to contain image 
formats, then it likewise makes sense for a general purpose data processing 
library like Spark to contain image formats.  This does not duplicate 
functionality from java.awt (or other libraries) since the key elements being 
added here are Spark-specific: a Spark DataFrame schema and a Spark Data Source.

Q: Will leaving this functionality in a package, rather than putting it in 
Spark, be sufficient?
A: I worry that this will limit adoption, as well as community oversight of 
such a core piece of functionality.  Tooling built upon image formats, 
including image processing algorithms, could live outside of Spark, but basic 
image loading and saving should IMO live in Spark.

Q: Will users really benefit?
A: My main reason to support this is confusion I've heard about the right way 
to handle images in Spark.  They are sometimes handled outside of Spark's data 
model (often giving up proper resilience guarantees), are handled by falling 
back to the RDD API, etc.  I hope that standardization will simplify life for 
users (clarifying and standardizing APIs) and library developers (facilitating 
collaboration on image ETL).

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with ima

[jira] [Assigned] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21875:


Assignee: Andrew Ash

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21875.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19088
[https://github.com/apache/spark/pull/19088]

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-08-30 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148227#comment-16148227
 ] 

Michael Armbrust edited comment on SPARK-20928 at 8/30/17 11:52 PM:


Hey everyone, thanks for your interest in this feature!  I'm still targeting 
Spark 2.3, but unfortunately have been busy with other things since the summit. 
Will post more details on the design as soon as we have them! The Spark summit 
demo just showed a hacked-together prototype, but we need to do more to figure 
out how to best integrate it into Spark.


was (Author: marmbrus):
Hey everyone, thanks for your interest in this feature!  I'm still targeting 
Spark 2.3, but unfortunately have been busy with other things since the summit. 
Will post more details on the design as soon as we have them!

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-08-30 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148227#comment-16148227
 ] 

Michael Armbrust commented on SPARK-20928:
--

Hey everyone, thanks for your interest in this feature!  I'm still targeting 
Spark 2.3, but unfortunately have been busy with other things since the summit. 
Will post more details on the design as soon as we have them!

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21839) Support SQL config for ORC compression

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21839.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19055
[https://github.com/apache/spark/pull/19055]

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21839) Support SQL config for ORC compression

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21839:


Assignee: Dongjoon Hyun

> Support SQL config for ORC compression 
> ---
>
> Key: SPARK-21839
> URL: https://issues.apache.org/jira/browse/SPARK-21839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> This issue aims to provide `spark.sql.orc.compression.codec` like 
> `spark.sql.parquet.compression.codec`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-08-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148093#comment-16148093
 ] 

Marcelo Vanzin commented on SPARK-18085:


[~jincheng] do you have code that can reproduce that? The code in the exception 
hasn't really changed, so this is probably an artifact of how the new code is 
recording data from the application. For your description (when we clicking 
stages with no tasks successful) I haven't been able to reproduce this; a stage 
that has only failed tasks still renders fine with the code in my branch.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-08-30 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148090#comment-16148090
 ] 

Reynold Xin commented on SPARK-21867:
-

This makes sense. The devil is in the details though (e.g. how complicated it 
is). Can you create a PR for your prototype code just to illustrate the 
implementation more?


> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-30 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21866:
--
Target Version/s:   (was: 2.3.0)

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The

[jira] [Resolved] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21834.

   Resolution: Fixed
 Assignee: Sital Kedia
Fix Version/s: 2.3.0
   2.2.1
   2.1.2

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21876) Idling Executors that never handled any tasks are not cleared from BlockManager after being removed

2017-08-30 Thread Julie Zhang (JIRA)

Julie Zhang created SPARK-21876:
---

 Summary: Idling Executors that never handled any tasks are not 
cleared from BlockManager after being removed
 Key: SPARK-21876
 URL: https://issues.apache.org/jira/browse/SPARK-21876
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.2.0, 1.6.3
Reporter: Julie Zhang


This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We use 
Yarn as our resource manager.

1) Executor A is launched, but no task has been submitted to it; 
2) After 'spark.dynamicAllocation.executorIdleTimeout' seconds, executor A will 
be removed. (ExecutorAllocationManager.scala schedule(): 294); 
3) The scheduler gets notified that executor A has been lost; (in our case, 
YarnSchedulerBackend.scla: 209).

In the TaskschedulerImpl.scala method executorLost(executorId: String, reason: 
ExecutorLossReason), the assumption in the None case(TaskSchedulerImpl.scala: 
548) that the executor has already been removed is not always valid. As a 
result, the DAGScheduler and BlockManagerMaster are never notified about the 
loss of executor A.

When GC eventually happens, the ContextCleaner will try to clean up 
un-referenced objects. Because the executor A was not removed from the 
blockManagerIdByExecutor map, BlockManagerMasterEndpoint will send out requests 
to clean the references to the non-existent executor, producing a lot of error 
message like this in the driver log:

ERROR [2017-08-08 00:00:23,596] 
org.apache.spark.network.client.TransportClient: Failed to send RPC xxx to 
xxx/xxx:x: java.nio.channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147987#comment-16147987
 ] 

Apache Spark commented on SPARK-21728:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19089

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: logging.patch, sparksubmit.patch
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21728:


Assignee: Apache Spark  (was: Marcelo Vanzin)

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: logging.patch, sparksubmit.patch
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21728:


Assignee: Marcelo Vanzin  (was: Apache Spark)

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: logging.patch, sparksubmit.patch
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21875:


Assignee: (was: Apache Spark)

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147976#comment-16147976
 ] 

Andrew Ash commented on SPARK-21875:


I'd be interested in more details on why it can't be run in the PR builder -- I 
have the full `./dev/run-tests` running in CI and it catches things like this 
occasionally

PR at https://github.com/apache/spark/pull/19088

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21875:


Assignee: Apache Spark

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Assignee: Apache Spark
>Priority: Trivial
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147975#comment-16147975
 ] 

Apache Spark commented on SPARK-21875:
--

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/19088

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-21728:

Attachment: logging.patch
sparksubmit.patch

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: logging.patch, sparksubmit.patch
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147957#comment-16147957
 ] 

Jacek Laskowski commented on SPARK-21728:
-

After I changed your change, I could see the logs again. No idea if the changes 
made sense or not, but see the logs and that counts :) I'm attaching my changes 
if these could help you somehow.



> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-21728:


> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147944#comment-16147944
 ] 

Marcelo Vanzin commented on SPARK-21728:


Ok, with your file and the streaming example from Spark docs I see it's picking 
up the wrong log level. I also ran into a separate issue with my patch; let me 
re-open this and fix these issues.

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again

2017-08-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21714:
---
Fix Version/s: 2.2.1

> SparkSubmit in Yarn Client mode downloads remote files and then reuploads 
> them again
> 
>
> Key: SPARK-21714
> URL: https://issues.apache.org/jira/browse/SPARK-21714
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Saisai Shao
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> SPARK-10643 added the ability for spark-submit to download remote file in 
> client mode.
> However in yarn mode this introduced a bug where it downloads them for the 
> client but then yarn client just reuploads them to HDFS and uses them again. 
> This should not happen when the remote file is HDFS.  This is wasting 
> resources and its defeating the  distributed cache because if the original 
> object was public it would have been shared by many users. By us downloading 
> and reuploading, it becomes private.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21875:
--
  Priority: Trivial  (was: Major)
Issue Type: Improvement  (was: Bug)

(Certianly not a Major Bug)
Yes, for Historical Reasons I can't find a link to, we can't run it in the PR 
builder. We just fix these periodically. Go ahead.

> Jenkins passes Java code that violates ./dev/lint-java
> --
>
> Key: SPARK-21875
> URL: https://issues.apache.org/jira/browse/SPARK-21875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Trivial
>
> Two recent PRs merged which caused lint-java errors:
> {noformat}
> 
> Running Java style checks
> 
> Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
> (sizes) LineLength: Line is longer than 100 characters (found 106).
> [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
> {noformat}
> The first error is from https://github.com/apache/spark/pull/19025 and the 
> second is from https://github.com/apache/spark/pull/18488
> Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java

2017-08-30 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-21875:
--

 Summary: Jenkins passes Java code that violates ./dev/lint-java
 Key: SPARK-21875
 URL: https://issues.apache.org/jira/browse/SPARK-21875
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Andrew Ash


Two recent PRs merged which caused lint-java errors:

{noformat}

Running Java style checks

Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] 
(sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] 
(sizes) LineLength: Line is longer than 100 characters (found 106).
[error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
{noformat}

The first error is from https://github.com/apache/spark/pull/19025 and the 
second is from https://github.com/apache/spark/pull/18488

Should we be expecting Jenkins to enforce Java code style pre-commit?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21841) Spark SQL doesn't pick up column added in hive when table created with saveAsTable

2017-08-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147873#comment-16147873
 ] 

Marcelo Vanzin commented on SPARK-21841:


Good to know there's a way to say "I want a proper Hive table" in 2.2, even if 
the API is a little confusing for the user. Too many people just use 
{{saveAsTable}} without really understanding what it means for Hive 
compatibility.

It might even make more sense to not even try to save a Hive compatible table 
for other formats, although that might have backwards compatibility issues.

> Spark SQL doesn't pick up column added in hive when table created with 
> saveAsTable
> --
>
> Key: SPARK-21841
> URL: https://issues.apache.org/jira/browse/SPARK-21841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Thomas Graves
>
> If you create a table in Spark sql but then you modify the table in hive to 
> add a column, spark sql doesn't pick up the new column.
> Basic example:
> {code}
> t1 = spark.sql("select ip_address from mydb.test_table limit 1")
> t1.show()
> ++
> |  ip_address|
> ++
> |1.30.25.5|
> ++
> t1.write.saveAsTable('mydb.t1')
> In Hive:
> alter table mydb.t1 add columns (bcookie string)
> t1 = spark.table("mydb.t1")
> t1.show()
> ++
> |  ip_address|
> ++
> |1.30.25.5|
> ++
> {code}
> It looks like its because spark sql is picking up the schema from 
> spark.sql.sources.schema.part.0 rather then from hive. 
> Interestingly enough it appears that if you create the table differently like:
> spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 
> 1") 
> Run your alter table on mydb.t1
> val t1 = spark.table("mydb.t1")  
> Then it works properly.
> It looks like the difference is when it doesn't work 
> spark.sql.sources.provider=parquet is set.
> Its doing this from createDataSourceTable where provider is parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147840#comment-16147840
 ] 

Jacek Laskowski commented on SPARK-21728:
-

The idea behind the custom {{conf/log4j.properties}} is to disable all the 
logging and enable only {{org.apache.spark.sql.execution.streaming}} currently.

{code}
$ cat conf/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=OFF, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: 
%m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, 
the
# log level for this class is used to overwrite the root logger's log level, so 
that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

#log4j.logger.org.apache.spark=OFF

log4j.logger.org.apache.spark.metrics.MetricsSystem=WARN

# Structured Streaming
log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG
log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO
log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG

log4j.logger.org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec=INFO
{code}

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21865) simplify the distribution semantic of Spark SQL

2017-08-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21865:

Summary: simplify the distribution semantic of Spark SQL  (was: remove 
Partitioning.compatibleWith)

> simplify the distribution semantic of Spark SQL
> ---
>
> Key: SPARK-21865
> URL: https://issues.apache.org/jira/browse/SPARK-21865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147655#comment-16147655
 ] 

Marcelo Vanzin commented on SPARK-21728:


Can you share your whole command line?

I just added this to my {{conf/log4j.properties}}:

{noformat}
log4j.logger.org.apache.spark.deploy=DEBUG
{noformat}

And ran:
{noformat}
./bin/spark-shell --master 'local-cluster[1,1,1024]'
{noformat}

And I see the log messages that I could not see without the custom setting.

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147643#comment-16147643
 ] 

Jacek Laskowski commented on SPARK-21728:
-

Thanks [~vanzin] for the prompt response! I'm stuck with the change as 
{{conf/log4j.properties}} has no effect on logging and given the change touched 
it I think it's the root cause (I might be mistaken, but looking for help to 
find it).

The following worked fine two days ago (not sure about yesterday's build). Is 
{{conf/log4j.properties}} still the file for logging?

{code}
# Structured Streaming
log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG
log4j.logger.org.apache.spark.sql.execution.streaming.ProgressReporter=INFO
log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG
log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=DEBUG
{code}

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147617#comment-16147617
 ] 

Marcelo Vanzin commented on SPARK-21728:


What is user-visible here (other than potentially some more log messages 
popping up)?

Logging should work just like before as far as the user is concerned. Their 
customized logging configuration and everything should still work; if that 
doesn't work it's a bug (although I remember testing that).

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21055) Support grouping__id

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147433#comment-16147433
 ] 

Apache Spark commented on SPARK-21055:
--

User 'cenyuhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/19087

> Support grouping__id
> 
>
> Key: SPARK-21055
> URL: https://issues.apache.org/jira/browse/SPARK-21055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark2.1.1
>Reporter: cen yuhai
>
> Now, spark doesn't support grouping__id, spark provide another function 
> grouping_id() to workaround. 
> If use grouping_id(), many scripts need to change and supporting  
> grouping__id is very easy, why not?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21874) Support changing database when rename table.

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147429#comment-16147429
 ] 

Apache Spark commented on SPARK-21874:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/19086

> Support changing database when rename table.
> 
>
> Key: SPARK-21874
> URL: https://issues.apache.org/jira/browse/SPARK-21874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> Change database of table by renaming is widely used in `Hive`. We can try add 
> this function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21874) Support changing database when rename table.

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21874:


Assignee: (was: Apache Spark)

> Support changing database when rename table.
> 
>
> Key: SPARK-21874
> URL: https://issues.apache.org/jira/browse/SPARK-21874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> Change database of table by renaming is widely used in `Hive`. We can try add 
> this function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21874) Support changing database when rename table.

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21874:


Assignee: Apache Spark

> Support changing database when rename table.
> 
>
> Key: SPARK-21874
> URL: https://issues.apache.org/jira/browse/SPARK-21874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: Apache Spark
>
> Change database of table by renaming is widely used in `Hive`. We can try add 
> this function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21453) Cached Kafka consumer may be closed too early

2017-08-30 Thread Pablo Panero (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147419#comment-16147419
 ] 

Pablo Panero commented on SPARK-21453:
--

[~zsxwing]
 Failed again but not even there is conext around it. Will try running in local 
for long. Maybe Im missing something
{code}
17/08/29 04:47:45 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
17/08/29 04:47:45 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
17/08/29 04:47:45 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
17/08/29 10:41:13 WARN SslTransportLayer: Failed to send SSL Close message 
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at 
org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:195)
at 
org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:163)
at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:731)
at 
org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:54)
at org.apache.kafka.common.network.Selector.doClose(Selector.java:540)
at org.apache.kafka.common.network.Selector.close(Selector.java:531)
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:378)
at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.kafka010.KafkaWriteTask.execute(KafkaWriteTask.scala:47)
at 
org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcV$sp(KafkaWriter.scala:91)
at 
org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KafkaWriter.scala:91)
at 
org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KafkaWriter.scala:91)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KafkaWriter.scala:91)
at 
org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KafkaWriter.scala:89)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org

[jira] [Created] (SPARK-21874) Support changing database when rename table.

2017-08-30 Thread jin xing (JIRA)

jin xing created SPARK-21874:


 Summary: Support changing database when rename table.
 Key: SPARK-21874
 URL: https://issues.apache.org/jira/browse/SPARK-21874
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: jin xing


Change database of table by renaming is widely used in `Hive`. We can try add 
this function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

2017-08-30 Thread Jonas Fonseca (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147393#comment-16147393
 ] 

Jonas Fonseca commented on SPARK-18859:
---

For postgresql using {{nullif}} can be used as a workaround:

{noformat}
(
  select
t.id,
t.age,
nullif(j.name, null) as name
  from
masterdata.testtable t
  left join masterdata.jointable j on t.id = j.id
) as testtable;
{noformat}

Tested with org.postgresql:postgresql:42.0.0

> Catalyst codegen does not mark column as nullable when it should. Causes NPE
> 
>
> Key: SPARK-18859
> URL: https://issues.apache.org/jira/browse/SPARK-18859
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.2
>Reporter: Mykhailo Osypov
>Priority: Critical
>
> When joining two tables via LEFT JOIN, columns in right table may be NULLs, 
> however catalyst codegen cannot recognize it.
> Example:
> {code:title=schema.sql|borderStyle=solid}
> create table masterdata.testtable(
>   id int not null,
>   age int
> );
> create table masterdata.jointable(
>   id int not null,
>   name text not null
> );
> {code}
> {code:title=query_to_select.sql|borderStyle=solid}
> (select t.id, t.age, j.name from masterdata.testtable t left join 
> masterdata.jointable j on t.id = j.id) as testtable;
> {code}
> {code:title=master code|borderStyle=solid}
> val df = sqlContext
>   .read
>   .format("jdbc")
>   .option("dbTable", "query to select")
>   
>   .load
> //df generated schema
> /*
> root
>  |-- id: integer (nullable = false)
>  |-- age: integer (nullable = true)
>  |-- name: string (nullable = false)
> */
> {code}
> {code:title=Codegen|borderStyle=solid}
> /* 038 */   scan_rowWriter.write(0, scan_value);
> /* 039 */
> /* 040 */   if (scan_isNull1) {
> /* 041 */ scan_rowWriter.setNullAt(1);
> /* 042 */   } else {
> /* 043 */ scan_rowWriter.write(1, scan_value1);
> /* 044 */   }
> /* 045 */
> /* 046 */   scan_rowWriter.write(2, scan_value2);
> {code}
> Since *j.name* is from right table of *left join* query, it may be null. 
> However generated schema doesn't think so (probably because it defined as 
> *name text not null*)
> {code:title=StackTrace|borderStyle=solid}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-30 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21469.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19024
[https://github.com/apache/spark/pull/19024]

> Add doc and example for FeatureHasher
> -
>
> Key: SPARK-21469
> URL: https://issues.apache.org/jira/browse/SPARK-21469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
> Fix For: 2.3.0
>
>
> Add examples and user guide section for {{FeatureHasher}} in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-30 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21469:
--

Assignee: Bryan Cutler

> Add doc and example for FeatureHasher
> -
>
> Key: SPARK-21469
> URL: https://issues.apache.org/jira/browse/SPARK-21469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> Add examples and user guide section for {{FeatureHasher}} in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2017-08-30 Thread Vinayak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136567#comment-16136567
 ] 

Vinayak edited comment on SPARK-18350 at 8/30/17 12:56 PM:
---

[~ueshin]  
I have set the below value to set the timeZone to UTC. It is adding the current 
timeZone value even though it is in the UTC format.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Find the attached csv data for reference.

Expected : Time should remain same as the input since it's already in UTC format

var df1 = spark.read.option("delimiter", ",").option("qualifier", 
"\"").option("inferSchema","true").option("header", "true").option("mode", 
"PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
 "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");

df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields]

scala> df1.show(false);


++---++---+---+--+---+

|Name|Age|Add |Date   |SparkDate  |SparkDate1
|SparkDate2 |

++---++---+---+--+---+

|abc |21 |bvxc|04/22/2017T03:30:02|2017-03-21 03:30:02|2017-03-21 
09:00:02.02|2017-03-21 05:30:00|

++---++---+---+--+---+


was (Author: vinayaksgadag):
[~ueshin]  
I have set the below value to set the timeZone to UTC. It is adding the current 
timeZone value even though it is in the UTC format.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Find the attached csv data for reference.

Expected : Time should remain same as the input since it's already in UTC format

var df1 = spark.read.option("delimiter", ",").option("qualifier", 
"\"").option("inferSchema","true").option("header", "true").option("mode", 
"PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
 "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");

df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields]

scala> df1.show(false);
--


Name Age Add  Date  SparkDate  SparkDate1  SparkDate2  

--


abc  21  bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 
2017-03-21 05:30:00 

--



> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2017-08-30 Thread Vinayak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136567#comment-16136567
 ] 

Vinayak edited comment on SPARK-18350 at 8/30/17 12:55 PM:
---

[~ueshin]  
I have set the below value to set the timeZone to UTC. It is adding the current 
timeZone value even though it is in the UTC format.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Find the attached csv data for reference.

Expected : Time should remain same as the input since it's already in UTC format

var df1 = spark.read.option("delimiter", ",").option("qualifier", 
"\"").option("inferSchema","true").option("header", "true").option("mode", 
"PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
 "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");

df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields]

scala> df1.show(false);
--


Name Age Add  Date  SparkDate  SparkDate1  SparkDate2  

--


abc  21  bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 
2017-03-21 05:30:00 

--




was (Author: vinayaksgadag):
[~ueshin]  
I have set the below value to set the timeZone to UTC. It is adding the current 
timeZone value even though it is in the UTC format.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Find the attached csv data for reference.

Expected : Time should remain same as the input since it's already in UTC format

var df1 = spark.read.option("delimiter", ",").option("qualifier", 
"\"").option("inferSchema","true").option("header", "true").option("mode", 
"PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
 "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");

df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields]

scala> df1.show(false);

--


Name Age Add  Date  SparkDate  SparkDate1  SparkDate2  

--


abc  21  bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 
2017-03-21 05:30:00 


> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21764:


Assignee: Hyukjin Kwon

> Tests failures on Windows: resources not being closed and incorrect paths
> -
>
> Key: SPARK-21764
> URL: https://issues.apache.org/jira/browse/SPARK-21764
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
> but decided to open another one here, targeting 2.3.0 as fixed version.
> In short, there are many test failures on Windows, mainly due to resources 
> not being closed but attempted to be removed (this is failed on Windows) and 
> incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21764.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18971
[https://github.com/apache/spark/pull/18971]

> Tests failures on Windows: resources not being closed and incorrect paths
> -
>
> Key: SPARK-21764
> URL: https://issues.apache.org/jira/browse/SPARK-21764
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
> but decided to open another one here, targeting 2.3.0 as fixed version.
> In short, there are many test failures on Windows, mainly due to resources 
> not being closed but attempted to be removed (this is failed on Windows) and 
> incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16643.
---
Resolution: Won't Fix

If this only affects 1.5/1.6, I think it'd be WontFix at this point. Update to 
Spark 2.

> When doing Shuffle, report "java.io.FileNotFoundException"
> --
>
> Key: SPARK-16643
> URL: https://issues.apache.org/jira/browse/SPARK-16643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: LSB Version: 
> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
> Distributor ID:   CentOS
> Description:  CentOS release 6.6 (Final)
> Release:  6.6
> Codename: Final
> java version "1.7.0_10"
> Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
> Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
>Reporter: Deng Changchun
>
> In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, 
> such  some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 
> and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000"
> at the begining all is good, however after about 15 days, when execute the 
> aggreate sqls, it will report error, the log looks like:
> 【Notice:
> it is very strange that it won't report error every time when executing 
> aggreate sql, let's say random, after executing some aggregate sqls, it will 
> log error by chance.】
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = 
> 624
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624)
> java.io.FileNotFoundException: 
> /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee
>  (没有那个文件或目录)
>   at java.io.FileOutputStream.open(Native Method)
>   at java.io.FileOutputStream.(FileOutputStream.java:212)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"

2017-08-30 Thread Sajith Dimal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147139#comment-16147139
 ] 

Sajith Dimal commented on SPARK-16643:
--

We observed this in spark version 1.6.2 as well, please find the bellow error 
log:

TID: [-1] [] [2017-08-01 22:05:16,768] ERROR 
{org.apache.spark.executor.Executor} -  Exception in task 0.0 in stage 6.0 (TID 
6) {org.apache.spark.executor.Executor}
java.io.FileNotFoundException: 
/tmp/blockmgr-d44d050f-8727-4f96-83f5-69e3281d7aa5/39/temp_shuffle_3145a66b-1823-4c82-a7ac-2ac55fd5726e
 (Stale file handle)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:206)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> When doing Shuffle, report "java.io.FileNotFoundException"
> --
>
> Key: SPARK-16643
> URL: https://issues.apache.org/jira/browse/SPARK-16643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: LSB Version: 
> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
> Distributor ID:   CentOS
> Description:  CentOS release 6.6 (Final)
> Release:  6.6
> Codename: Final
> java version "1.7.0_10"
> Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
> Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
>Reporter: Deng Changchun
>
> In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, 
> such  some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 
> and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000"
> at the begining all is good, however after about 15 days, when execute the 
> aggreate sqls, it will report error, the log looks like:
> 【Notice:
> it is very strange that it won't report error every time when executing 
> aggreate sql, let's say random, after executing some aggregate sqls, it will 
> log error by chance.】
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = 
> 624
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624)
> java.io.FileNotFoundException: 
> /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee
>  (没有那个文件或目录)
>   at java.io.FileOutputStream.open(Native Method)
>   at java.io.FileOutputStream.(FileOutputStream.java:212)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147086#comment-16147086
 ] 

Jacek Laskowski commented on SPARK-21728:
-

Thanks [~sowen]. I'll label it as such when I know if it merits one (looks so, 
but waiting for a response from [~vanzin] or others who'd know). Sent out an 
email to the Spark user mailing list today.

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21806.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19038
[https://github.com/apache/spark/pull/19038]

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> --
>
> Key: SPARK-21806
> URL: https://issues.apache.org/jira/browse/SPARK-21806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Marc Kaminski
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.3.0
>
> Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn| 
> https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior 
> is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 
> 1.0). This behavior is not ideal in certain edge cases (see example below) 
> and can also have an impact on cross validation, when optimization metric is 
> set to "areaUnderPR". 
> Please consider [blucena's 
> post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
>  for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A 
> possible output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated 
> probability assignment  will be falsely assumed to perform worse in regard to 
> this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147037#comment-16147037
 ] 

Sean Owen commented on SPARK-21728:
---

You can label such issues with 'releasenotes'. I consider that for breaking 
changes or bug fixes that change behavior. I don't know if this is one of the 
few important items I'd highlight. But if someone's in doubt, tag it, and the 
release manager can judge.

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21806:
-

Assignee: Sean Owen
  Labels: releasenotes  (was: )

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> --
>
> Key: SPARK-21806
> URL: https://issues.apache.org/jira/browse/SPARK-21806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Marc Kaminski
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn| 
> https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior 
> is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 
> 1.0). This behavior is not ideal in certain edge cases (see example below) 
> and can also have an impact on cross validation, when optimization metric is 
> set to "areaUnderPR". 
> Please consider [blucena's 
> post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
>  for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A 
> possible output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated 
> probability assignment  will be falsely assumed to perform worse in regard to 
> this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21829) Enable config to permanently blacklist a list of nodes

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21829.
---
Resolution: Won't Fix

> Enable config to permanently blacklist a list of nodes
> --
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21873.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19059
[https://github.com/apache/spark/pull/19059]

> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:java}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:java}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21873:
-

Assignee: Yuval Itzchakov
Target Version/s:   (was: 2.2.1)

> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Assignee: Yuval Itzchakov
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:java}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:java}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21873:
--
Issue Type: Improvement  (was: Bug)

> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:java}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:java}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21534:


Assignee: (was: Apache Spark)

> PickleException when creating dataframe from python row with empty bytearray
> 
>
> Key: SPARK-21534
> URL: https://issues.apache.org/jira/browse/SPARK-21534
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> {code}
> spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
> {"abc": x.xx})).show()
> {code}
> This code creates exception. It looks like corner-case.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
> expected 1 or 2 args, got 0
>   at 
> net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset.org$apac

[jira] [Commented] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146925#comment-16146925
 ] 

Apache Spark commented on SPARK-21534:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19085

> PickleException when creating dataframe from python row with empty bytearray
> 
>
> Key: SPARK-21534
> URL: https://issues.apache.org/jira/browse/SPARK-21534
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> {code}
> spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
> {"abc": x.xx})).show()
> {code}
> This code creates exception. It looks like corner-case.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
> expected 1 or 2 args, got 0
>   at 
> net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
>   at 
> org.apache.spark.

[jira] [Assigned] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21534:


Assignee: Apache Spark

> PickleException when creating dataframe from python row with empty bytearray
> 
>
> Key: SPARK-21534
> URL: https://issues.apache.org/jira/browse/SPARK-21534
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> {code}
> spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
> {"abc": x.xx})).show()
> {code}
> This code creates exception. It looks like corner-case.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
> expected 1 or 2 args, got 0
>   at 
> net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.s

[jira] [Updated] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Yuval Itzchakov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuval Itzchakov updated SPARK-21873:

Description: 
In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
exception to be thrown and caught in order to escape the current scope.

While profiling Structured Streaming in production, it clearly shows:

!https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!

This happens during a 1 minute profiling session on a single executor. The code 
is:

{code:java}
while (toFetchOffset != UNKNOWN_OFFSET) {
  try {
return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
failOnDataLoss)
  } catch {
case e: OffsetOutOfRangeException =>
  // When there is some error thrown, it's better to use a new consumer 
to drop all cached
  // states in the old consumer. We don't need to worry about the 
performance because this
  // is not a common path.
  resetConsumer()
  reportDataLoss(failOnDataLoss, s"Cannot fetch offset $toFetchOffset", 
e)
  toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
untilOffset)
  }
}
{code}

This happens because this method is converted to a function which is ran inside:

{code:java}
private def runUninterruptiblyIfPossible[T](body: => T): T
{code}

We should avoid using `return` in general, and here specifically as it is a hot 
path for applications using Kafka.


  was:
In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
exception to be thrown and caught in order to escape the current scope.

While profiling Structured Streaming in production, it clearly shows:

!https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!

This happens during a 1 minute profiling session on a single executor. The code 
is:

{code:scala}
while (toFetchOffset != UNKNOWN_OFFSET) {
  try {
return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
failOnDataLoss)
  } catch {
case e: OffsetOutOfRangeException =>
  // When there is some error thrown, it's better to use a new consumer 
to drop all cached
  // states in the old consumer. We don't need to worry about the 
performance because this
  // is not a common path.
  resetConsumer()
  reportDataLoss(failOnDataLoss, s"Cannot fetch offset $toFetchOffset", 
e)
  toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
untilOffset)
  }
}
{code}

This happens because this method is converted to a function which is ran inside:

{code:scala}
private def runUninterruptiblyIfPossible[T](body: => T): T
{code}

We should avoid using `return` in general, and here specifically as it is a hot 
path for applications using Kafka.



> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:java}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:java}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-

[jira] [Assigned] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21873:


Assignee: (was: Apache Spark)

> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:scala}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:scala}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21873:


Assignee: Apache Spark

> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:scala}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:scala}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146901#comment-16146901
 ] 

Apache Spark commented on SPARK-21873:
--

User 'YuvalItzchakov' has created a pull request for this issue:
https://github.com/apache/spark/pull/19059

> CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
> ---
>
> Key: SPARK-21873
> URL: https://issues.apache.org/jira/browse/SPARK-21873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Yuval Itzchakov
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
> exception to be thrown and caught in order to escape the current scope.
> While profiling Structured Streaming in production, it clearly shows:
> !https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!
> This happens during a 1 minute profiling session on a single executor. The 
> code is:
> {code:scala}
> while (toFetchOffset != UNKNOWN_OFFSET) {
>   try {
> return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
> failOnDataLoss)
>   } catch {
> case e: OffsetOutOfRangeException =>
>   // When there is some error thrown, it's better to use a new 
> consumer to drop all cached
>   // states in the old consumer. We don't need to worry about the 
> performance because this
>   // is not a common path.
>   resetConsumer()
>   reportDataLoss(failOnDataLoss, s"Cannot fetch offset 
> $toFetchOffset", e)
>   toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
> untilOffset)
>   }
> }
> {code}
> This happens because this method is converted to a function which is ran 
> inside:
> {code:scala}
> private def runUninterruptiblyIfPossible[T](body: => T): T
> {code}
> We should avoid using `return` in general, and here specifically as it is a 
> hot path for applications using Kafka.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21254:
--
Fix Version/s: 2.2.1

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
>Assignee: Dmitry Parfenchik
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21873) CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka

2017-08-30 Thread Yuval Itzchakov (JIRA)

Yuval Itzchakov created SPARK-21873:
---

 Summary: CachedKafkaConsumer throws NonLocalReturnControl during 
fetching from Kafka
 Key: SPARK-21873
 URL: https://issues.apache.org/jira/browse/SPARK-21873
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0, 2.1.1, 2.1.0
Reporter: Yuval Itzchakov
Priority: Minor


In Scala, using `return` inside a function causes a `NonLocalReturnControl` 
exception to be thrown and caught in order to escape the current scope.

While profiling Structured Streaming in production, it clearly shows:

!https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png!

This happens during a 1 minute profiling session on a single executor. The code 
is:

{code:scala}
while (toFetchOffset != UNKNOWN_OFFSET) {
  try {
return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, 
failOnDataLoss)
  } catch {
case e: OffsetOutOfRangeException =>
  // When there is some error thrown, it's better to use a new consumer 
to drop all cached
  // states in the old consumer. We don't need to worry about the 
performance because this
  // is not a common path.
  resetConsumer()
  reportDataLoss(failOnDataLoss, s"Cannot fetch offset $toFetchOffset", 
e)
  toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, 
untilOffset)
  }
}
{code}

This happens because this method is converted to a function which is ran inside:

{code:scala}
private def runUninterruptiblyIfPossible[T](body: => T): T
{code}

We should avoid using `return` in general, and here specifically as it is a hot 
path for applications using Kafka.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21872) Is job duration value of Spark Jobs page on Web UI correct?

2017-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21872.
---
Resolution: Not A Problem

(Questions belong on the mailing list to start.)
Yes, because in general an app is frequently waiting on resources throughout 
its execution, like for example waiting for executor slots to free up. This 
should be counted as part of its execution time.

> Is job duration value of Spark Jobs page on Web UI correct? 
> 
>
> Key: SPARK-21872
> URL: https://issues.apache.org/jira/browse/SPARK-21872
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: iamhumanbeing
>Priority: Minor
>
> I have submitted 2 spark jobs at the same time. but only one get running, the 
> other is waiting for resources. but the Web UI display that both of the jobs 
> is running. the job waiting for resources have the duration values increase. 
> So, Job 7 only runing 14s, but duration value is 29s. 
> Active Jobs (2)
> Job Id  ▾Description Submitted
> Duration  Stages: Succeeded/Total   Tasks (for all stages): Succeeded/Total
> 7 (kill)count at :30 2017/08/30 11:33:46 7 s 0/1 
> 0/100
> 6 (kill)count at :30 2017/08/30 11:33:46 8 s 0/2 
> 15/127 (2 running)
> after job finished
> 7 count at :30   2017/08/30 11:33:46 29 s1/1 100/100
> 6 count at :30   2017/08/30 11:33:46 16 s1/1 (1 skipped) 
> 27/27 (100 skipped) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2017-08-30 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146889#comment-16146889
 ] 

Prashant Sharma commented on SPARK-21869:
-

Yes, looking at it. 

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-30 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146780#comment-16146780
 ] 

Jacek Laskowski commented on SPARK-21728:
-

I think the change is user-visible and therefore deserves to be included in the 
release notes for 2.3 (I remember a component or label to mark changes like 
that in a special way) /cc [~sowen] [~hyukjin.kwon]

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

94 matches

Mail list logo