[jira] [Created] (SPARK-22207) High memory usage when converting relational data to Hierarchical data

2017-10-04 Thread kanika dhuria (JIRA)
kanika dhuria created SPARK-22207:
-

 Summary: High memory usage when converting relational data to 
Hierarchical data
 Key: SPARK-22207
 URL: https://issues.apache.org/jira/browse/SPARK-22207
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: kanika dhuria


Have 4 tables 
lineitems ~1.4Gb,
orders ~ 330MB
customer ~47MB
nations ~ 2.2K

These tables are related as follows
There are multiple lineitems per order (pk, fk:orderkey)
There are multiple orders per customer(pk,fk: cust_key)
There are multiple customers per nation(pk, fk:nation key)

Data is almost evenly distributed.

Building hierarchy till 3 levels i.e joining lineitems, orders, customers works 
good with executor memory 4Gb/2cores
Adding nations require 8GB/2 cores or 4GB/1 core memory.

==

{noformat}
val sqlContext = SparkSession.builder() .enableHiveSupport() 
.config("spark.sql.retainGroupColumns", false) 
.config("spark.sql.crossJoin.enabled", true) .getOrCreate()
 
  val orders = sqlContext.sql("select * from orders")
  val lineItem = sqlContext.sql("select * from lineitems")
  
  val customer = sqlContext.sql("select * from customers")
  
  val nation = sqlContext.sql("select * from nations")
  
  val lineitemOrders = 
lineItem.groupBy(col("l_orderkey")).agg(col("l_orderkey"), 
collect_list(struct(col("l_partkey"), 
col("l_suppkey"),col("l_linenumber"),col("l_quantity"),col("l_extendedprice"),col("l_discount"),col("l_tax"),col("l_returnflag"),col("l_linestatus"),col("l_shipdate"),col("l_commitdate"),col("l_receiptdate"),col("l_shipinstruct"),col("l_shipmode"))).as("lineitem")).join(orders,
 orders("O_ORDERKEY")=== lineItem("l_orderkey")).select(col("O_ORDERKEY"), 
col("O_CUSTKEY"),  col("O_ORDERSTATUS"), col("O_TOTALPRICE"), 
col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), 
col("O_SHIPPRIORITY"), col("O_COMMENT"),  col("lineitem"))  
  
  val customerList = 
lineitemOrders.groupBy(col("o_custkey")).agg(col("o_custkey"),collect_list(struct(col("O_ORDERKEY"),
 col("O_CUSTKEY"),  col("O_ORDERSTATUS"), col("O_TOTALPRICE"), 
col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), 
col("O_SHIPPRIORITY"), 
col("O_COMMENT"),col("lineitem"))).as("items")).join(customer,customer("c_custkey")===
 
lineitemOrders("o_custkey")).select(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))


 val nationList = 
customerList.groupBy(col("c_nationkey")).agg(col("c_nationkey"),collect_list(struct(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))).as("custList")).join(nation,nation("n_nationkey")===customerList("c_nationkey")).select(col("n_nationkey"),col("n_name"),col("custList"))
 
  nationList.write.mode("overwrite").json("filePath")

{noformat}


If the customeList is saved in a file and then the last agg/join is run 
separately, it does run fine in 4GB/2 core .

I can provide the data if needed.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22202) Release tgz content differences for python and R

2017-10-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22202:
-
Description: 
As a follow up to SPARK-22167, currently we are running different 
profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
should consider if these differences are significant and whether they should be 
addressed.

A couple of things:
- R.../doc directory is not in any release jar except hadoop 2.6
- python/dist, python.egg-info are not in any release jar except hadoop 2.7
- R DESCRIPTION has a few additions

I've checked to confirm these are the same in 2.1.1 release so this isn't a 
regression.

{code}
spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
sparkr-vignettes.Rmd
sparkr-vignettes.R
sparkr-vignettes.html
index.html

Only in spark-2.1.2-bin-hadoop2.7/python: dist
Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info

diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
25a26,27
> NeedsCompilation: no
> Packaged: 2017-10-03 00:42:30 UTC; holden
31c33
< Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
---
> Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
16a17
> User guides, package vignettes and other 
> documentation.
Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
{code}

  was:
As a follow up to SPARK-22167, currently we are running different 
profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
should consider if these differences are significant and whether they should be 
addressed.

[will add more info on this soon]


> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22206) gapply in R can't work on empty grouping columns

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22206:


Assignee: (was: Apache Spark)

> gapply in R can't work on empty grouping columns
> 
>
> Key: SPARK-22206
> URL: https://issues.apache.org/jira/browse/SPARK-22206
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but 
> {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty 
> grouping attributes. So {{gapply}} can't work on empty grouping columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22206) gapply in R can't work on empty grouping columns

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22206:


Assignee: Apache Spark

> gapply in R can't work on empty grouping columns
> 
>
> Key: SPARK-22206
> URL: https://issues.apache.org/jira/browse/SPARK-22206
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but 
> {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty 
> grouping attributes. So {{gapply}} can't work on empty grouping columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22206) gapply in R can't work on empty grouping columns

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192472#comment-16192472
 ] 

Apache Spark commented on SPARK-22206:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19436

> gapply in R can't work on empty grouping columns
> 
>
> Key: SPARK-22206
> URL: https://issues.apache.org/jira/browse/SPARK-22206
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but 
> {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty 
> grouping attributes. So {{gapply}} can't work on empty grouping columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22206) gapply in R can't work on empty grouping columns

2017-10-04 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-22206:
---

 Summary: gapply in R can't work on empty grouping columns
 Key: SPARK-22206
 URL: https://issues.apache.org/jira/browse/SPARK-22206
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh


{{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but 
{{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty 
grouping attributes. So {{gapply}} can't work on empty grouping columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22204) Explain output for SQL with commands shows no optimization

2017-10-04 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192454#comment-16192454
 ] 

Takeshi Yamamuro commented on SPARK-22204:
--

good catch. The optimization is internally applied though, I also feel we 
better fix this to make users more understood.

> Explain output for SQL with commands shows no optimization
> --
>
> Key: SPARK-22204
> URL: https://issues.apache.org/jira/browse/SPARK-22204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> When displaying the explain output for a basic SELECT query, the query plan 
> changes as expected from analyzed -> optimized stages.  But when putting that 
> same query into a command, for example {{CREATE TABLE}} it appears that the 
> optimization doesn't take place.
> In Spark shell:
> Explain output for a {{SELECT}} statement shows optimization:
> {noformat}
> scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) 
> AS b) AS c) AS d").explain(true)
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'SubqueryAlias d
>+- 'Project ['a]
>   +- 'SubqueryAlias c
>  +- 'Project ['a]
> +- SubqueryAlias b
>+- Project [1 AS a#29]
>   +- OneRowRelation
> == Analyzed Logical Plan ==
> a: int
> Project [a#29]
> +- SubqueryAlias d
>+- Project [a#29]
>   +- SubqueryAlias c
>  +- Project [a#29]
> +- SubqueryAlias b
>+- Project [1 AS a#29]
>   +- OneRowRelation
> == Optimized Logical Plan ==
> Project [1 AS a#29]
> +- OneRowRelation
> == Physical Plan ==
> *Project [1 AS a#29]
> +- Scan OneRowRelation[]
> scala> 
> {noformat}
> But the same command run inside {{CREATE TABLE}} does not:
> {noformat}
> scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM 
> (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true)
> == Parsed Logical Plan ==
> 'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> Ignore
> +- 'Project ['a]
>+- 'SubqueryAlias d
>   +- 'Project ['a]
>  +- 'SubqueryAlias c
> +- 'Project ['a]
>+- SubqueryAlias b
>   +- Project [1 AS a#33]
>  +- OneRowRelation
> == Analyzed Logical Plan ==
> CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
> InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> == Optimized Logical Plan ==
> CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
> InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> == Physical Plan ==
> CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand 
> [Database:default}, TableName: tmptable, InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> scala>
> {noformat}
> Note that there is no change between the analyzed and optimized plans when 
> run in a command.
> This is misleading my users, as they think that there is no optimization 
> happening in the query!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22203) Add job description for file listing Spark jobs

2017-10-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22203.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add job description for file listing Spark jobs
> ---
>
> Key: SPARK-22203
> URL: https://issues.apache.org/jira/browse/SPARK-22203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
>
> The user may be confused about some 1-tasks jobs. We can add a job 
> description for these jobs so that the user can figure it out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21785) Support create table from a file schema

2017-10-04 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192433#comment-16192433
 ] 

Takeshi Yamamuro commented on SPARK-21785:
--

{code}
scala> sql("""CREATE TABLE customer USING parquet OPTIONS (path 
'/tmp/spark-tpcds/customer')""")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.table("customer").printSchema
root
 |-- c_customer_sk: integer (nullable = true)
 |-- c_customer_id: string (nullable = true)
 |-- c_current_cdemo_sk: integer (nullable = true)
 |-- c_current_hdemo_sk: integer (nullable = true)
 |-- c_current_addr_sk: integer (nullable = true)
 |-- c_first_shipto_date_sk: integer (nullable = true)
 |-- c_first_sales_date_sk: integer (nullable = true)
 |-- c_salutation: string (nullable = true)
 |-- c_first_name: string (nullable = true)
 |-- c_last_name: string (nullable = true)
 |-- c_preferred_cust_flag: string (nullable = true)
 |-- c_birth_day: integer (nullable = true)
 |-- c_birth_month: integer (nullable = true)
 |-- c_birth_year: integer (nullable = true)
 |-- c_birth_country: string (nullable = true)
 |-- c_login: string (nullable = true)
 |-- c_email_address: string (nullable = true)
 |-- c_last_review_date: string (nullable = true)
{code}
Am I missing something?

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21785) Support create table from a file schema

2017-10-04 Thread Jacky Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192430#comment-16192430
 ] 

Jacky Shen commented on SPARK-21785:


from my understanding, "CREATE TABLE xxx USING parquet" won't get schema from 
the file.

what [~liupengcheng] wants to get the schema from the given parquet file.

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-10-04 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192414#comment-16192414
 ] 

Liang-Chi Hsieh commented on SPARK-22137:
-

Ideally I think a UDT should be able casted to/from corresponding SQL type 
({{UserDefinedType.sqlType}}). Practically I'm not sure if there are any reason 
we prevent such casting for now.

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct,values:array>) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct,values:array>)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21783) Turn on ORC filter push-down by default

2017-10-04 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-21783:
---

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Vadim Semenov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192407#comment-16192407
 ] 

Vadim Semenov commented on SPARK-21999:
---

Could you somehow reproduce the issue? If it's related to serialization, it 
should be easy to write an example that reproduces that. Or maybe post the code 
that hits the issue?

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.streaming.dst

[jira] [Commented] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192406#comment-16192406
 ] 

Sean Owen commented on SPARK-22187:
---

[~tdas] fixed for 2.3.0 right, if it's in master?

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>  Labels: release-notes, releasenotes
> Fix For: 3.0.0
>
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192395#comment-16192395
 ] 

Michael N commented on SPARK-22163:
---

Committers: The actual issues are with Seam Owen himself where

1. Sean Owen does not completely understand the issues in the tickets.
2. For the posted questions in the ticket to help with understanding the issue, 
he does not know the answers.
3. Instead of either find the correct answers to the questions or ask people 
who know the correct answers, he'd blindly close the tickets.

For instance, for the issue with ticket 
https://issues.apache.org/jira/browse/SPARK-22163, it occurs in both slaves and 
driver. Further, while Spark uses multiple threads to run the Stream processing 
asynchronously, there is still synchronization from batch to batch. For 
instance, say batch interval is 5 seconds. But batch #3 takes 1 minute to 
complete, Spark does not start batch #4 until batch #3 is completelty done. So 
in aspect, batch processing between each interval is synchronous. Yet Sean Owen 
assumes everything in Spark Stream is asynchronous.

Another instance is Sean Owen does not understand the difference between design 
flaw and coding bugs. Code may be perfect and match with the design. However, 
if the design is flawed, then it is the design that needs to be changed. An 
example is in the earlier Spark API around 1.x, Spark's map interface passes in 
only one object at a time. So that is a design flaw because it causes massive 
overhead for billions of objects. The newer Spark map interface passes in a 
list of objects via an Iterator.

Therefore, the broader issue is with Sean Owen acting more a blocker and 
closing the tickets when he does not understand them. When such cases come up, 
he should have either learn about the issues or ask someone else to do that.


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192394#comment-16192394
 ] 

Michael N edited comment on SPARK-21999 at 10/5/17 2:38 AM:


Committers: The actual issues are with Seam Owen himself where 

1. Sean Owen does not completely understand the issues in the tickets. 
2. For the posted questions in the ticket to help with understanding the issue, 
he does not  know the answers. 
3. Instead of either find the correct answers to the questions or ask people 
who know the correct answers, he'd  blindly close the tickets.

For instance, for the issue with ticket 
https://issues.apache.org/jira/browse/SPARK-22163, it occurs in both slaves and 
driver. Further, while Spark uses multiple threads to run the Stream processing 
asynchronously, there is still synchronization from batch to batch. For 
instance, say batch interval is 5 seconds. But batch #3 takes 1 minute to 
complete, Spark does not start batch #4 until batch #3 is completelty done. So 
in aspect, batch processing between each interval is synchronous.  Yet Sean 
Owen assumes everything in Spark Stream is asynchronous.

Another instance is Sean Owen does not understand the difference between design 
flaw and coding bugs. Code may be perfect and match with the design. However, 
if the design is flawed, then it is the design that needs to be changed.  An 
example is in the earlier Spark API around 1.x, Spark's map interface passes in 
only one object at a time. So that is a design flaw because it causes massive 
overhead for billions of objects.  The newer Spark map interface passes in a 
list of objects via an Iterator.

Therefore, the broader issue is with Sean Owen acting more a blocker and 
closing the tickets when he does not understand them.  When such cases come up, 
he should have either learn about the issues or ask someone else to do that.


was (Author: michaeln_apache):
The actual issues are with Seam Owen himself where 

1. Sean Owen does not completely understand the issues in the tickets. 
2. For the posted questions in the ticket to help with understanding the issue, 
he does not  know the answers. 
3. Instead of either find the correct answers to the questions or ask people 
who know the correct answers, he'd  blindly close the tickets.

For instance, for the issue with ticket 
https://issues.apache.org/jira/browse/SPARK-22163, it occurs in both slaves and 
driver. Further, while Spark uses multiple threads to run the Stream processing 
asynchronously, there is still synchronization from batch to batch. For 
instance, say batch interval is 5 seconds. But batch #3 takes 1 minute to 
complete, Spark does not start batch #4 until batch #3 is completelty done. So 
in aspect, batch processing between each interval is synchronous.  Yet Sean 
Owen assumes everything in Spark Stream is asynchronous.

Another instance is Sean Owen does not understand the difference between design 
flaw and coding bugs. Code may be perfect and match with the design. However, 
if the design is flawed, then it is the design that needs to be changed.  An 
example is in the earlier Spark API around 1.x, Spark's map interface passes in 
only one object at a time. So that is a design flaw because it causes massive 
overhead for billions of objects.  The newer Spark map interface passes in a 
list of objects via an Iterator.

Therefore, the broader issue is with Sean Owen acting more a blocker and 
closing the tickets when he does not understand them.  When such cases come up, 
he should have either learn about the issues or ask someone else to do that.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Bec

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192394#comment-16192394
 ] 

Michael N commented on SPARK-21999:
---

The actual issues are with Seam Owen himself where 

1. Sean Owen does not completely understand the issues in the tickets. 
2. For the posted questions in the ticket to help with understanding the issue, 
he does not  know the answers. 
3. Instead of either find the correct answers to the questions or ask people 
who know the correct answers, he'd  blindly close the tickets.

For instance, for the issue with ticket 
https://issues.apache.org/jira/browse/SPARK-22163, it occurs in both slaves and 
driver. Further, while Spark uses multiple threads to run the Stream processing 
asynchronously, there is still synchronization from batch to batch. For 
instance, say batch interval is 5 seconds. But batch #3 takes 1 minute to 
complete, Spark does not start batch #4 until batch #3 is completelty done. So 
in aspect, batch processing between each interval is synchronous.  Yet Sean 
Owen assumes everything in Spark Stream is asynchronous.

Another instance is Sean Owen does not understand the difference between design 
flaw and coding bugs. Code may be perfect and match with the design. However, 
if the design is flawed, then it is the design that needs to be changed.  An 
example is in the earlier Spark API around 1.x, Spark's map interface passes in 
only one object at a time. So that is a design flaw because it causes massive 
overhead for billions of objects.  The newer Spark map interface passes in a 
list of objects via an Iterator.

Therefore, the broader issue is with Sean Owen acting more a blocker and 
closing the tickets when he does not understand them.  When such cases come up, 
he should have either learn about the issues or ask someone else to do that.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$get

[jira] [Resolved] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-10-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22187.
---
   Resolution: Fixed
Fix Version/s: (was: 2.2.0)
   3.0.0

Issue resolved by pull request 19416
[https://github.com/apache/spark/pull/19416]

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>  Labels: release-notes, releasenotes
> Fix For: 3.0.0
>
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192352#comment-16192352
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 2:11 AM:


It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers to the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 

Btw, your response to the ticket 
https://issues.apache.org/jira/browse/SPARK-21999 where you said

  "Your app is modifying a collection asynchronously w.r.t. Spark. Right"

confirmed that you do not understand the issue.   *This issue occurs in both 
the slave nodes and the driver*.  My app is *not* modifying a collection 
asynchronously w.r.t. Spark.   So you kept making the same invalid claim and 
kept closing the ticket that you do not understand.   My Streaming Spark 
application  is run synchronously by Spark Streaming framework from batch to 
batch. And it modifies the data synchronously as part of the batch processing. 
However, Spark framework has another thread that *asynchronously* serializes 
the application objects.


was (Author: michaeln_apache):
It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers to the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 

Btw, your response to the ticket 
https://issues.apache.org/jira/browse/SPARK-21999 where you said

  "Your app is modifying a collection asynchronously w.r.t. Spark. Right"

confirmed that you do not understand the issue.  My app is *not* modifying a 
collection asynchronously w.r.t. Spark.   So you kept making the same invalid 
claim and kept closing the ticket that you do not understand.   My Streaming 
Spark application  is run synchronously by Spark Streaming framework from batch 
to batch. And it modifies the data synchronously as part of the batch 
processing. However, Spark framework has another thread that *asynchronously* 
serializes the application objects.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192374#comment-16192374
 ] 

Sean Owen commented on SPARK-21999:
---

Committers: rather than continue the close battle with this person, I'm asking 
INFRA to disable this.
https://issues.apache.org/jira/browse/INFRA-15221

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(Tr

[jira] [Commented] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192373#comment-16192373
 ] 

Sean Owen commented on SPARK-22163:
---

Committers: rather than continue the close battle with this person, I'm asking 
INFRA to disable this.
https://issues.apache.org/jira/browse/INFRA-15221

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N reopened SPARK-21999:
---

Sean, this issue occurs in both the slave nodes and the driver.  No, my app is 
not modifying a collection asynchronously w.r.t. Spark. You misunderstood the 
issue completely and kept making the same invalid claim. My Streaming Spark 
application is run synchronously by Spark Streaming framework from batch to 
batch. And it modifies the data synchronously as part of the batch processing. 
However, Spark framework has another thread that asynchronously serializes the 
application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Do not blindly close tickets until you fully understand them and provide the 
answers to the guided questions for you to understand the issues.


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> 

[jira] [Closed] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-21999.
-
Resolution: Invalid

Discussion may continue here but should not be reopened except by committers, 
when there is a concrete behavior change or PR to discuss.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42)
>   at 
> org.apach

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192364#comment-16192364
 ] 

Sean Owen commented on SPARK-21999:
---

[~michaeln_apache] I'll have to see if we can block you from accessing the 
Spark JIRA if you keep reopening the issue. Refer to your comment at 
https://issues.apache.org/jira/browse/SPARK-21999?focusedCommentId=16185130&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16185130

Spark Streaming is inherently asynchronous with respect to the driver process. 
Batch processing is also inherently asynchronous in that many things are 
happening at once. You're asking as if this is unexpected, like your comment 
about map vs flatMap, which suggests there may be some fundamental 
misunderstandings about Spark.

There is no reproducible issue here that demonstrates behavior that doesn't 
make sense or is inconsistent with an API, which is also a key component of a 
bug report.

Here's another productive way forward: propose a change that you think resolves 
something?

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstre

[jira] [Commented] (SPARK-21785) Support create table from a file schema

2017-10-04 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192362#comment-16192362
 ] 

Takeshi Yamamuro commented on SPARK-21785:
--

`CREATE TABLE xxx USING parquet OPTIONS (path 'xxx')` is not enough for your 
case?

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192348#comment-16192348
 ] 

Michael N edited comment on SPARK-21999 at 10/5/17 1:46 AM:


No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application  is run synchronously by Spark Streaming framework 
from batch to batch. And it modifies the data synchronously as part of the 
batch processing. However, Spark framework has another thread that 
*asynchronously* serializes the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?

Do not blindly close tickets until you fully understand them and provide the 
answers to the guided questions for you to understand the issues. 


was (Author: michaeln_apache):
No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application  is run synchronously by Spark Streaming framework 
from batch to batch. And it modifies the data synchronously as part of the 
batch processing. However, Spark framework has another thread that 
*asynchronously* serializes the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?

Do not blindly close tickets until you fully understand them and provide the 
guided questions for you to understand the issues. 

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.sca

[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192348#comment-16192348
 ] 

Michael N edited comment on SPARK-21999 at 10/5/17 1:43 AM:


No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application  is run synchronously by Spark Streaming framework 
from batch to batch. And it modifies the data synchronously as part of the 
batch processing. However, Spark framework has another thread that 
*asynchronously* serializes the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?

Do not blindly close tickets until you fully understand them and provide the 
guided questions for you to understand the issues. 


was (Author: michaeln_apache):
No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application  is run synchronously by Spark Streaming framework. 
However, Spark framework has another thread that *asynchronously* serializes 
the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrComp

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192352#comment-16192352
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 1:42 AM:


It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers to the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 

Btw, your response to the ticket 
https://issues.apache.org/jira/browse/SPARK-21999 where you said

  "Your app is modifying a collection asynchronously w.r.t. Spark. Right"

confirmed that you do not understand the issue.  My app is *not* modifying a 
collection asynchronously w.r.t. Spark.   So you kept making the same invalid 
claim and kept closing the ticket that you do not understand.   My Streaming 
Spark application  is run synchronously by Spark Streaming framework from batch 
to batch. And it modifies the data synchronously as part of the batch 
processing. However, Spark framework has another thread that *asynchronously* 
serializes the application objects.


was (Author: michaeln_apache):
It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers the the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192348#comment-16192348
 ] 

Michael N edited comment on SPARK-21999 at 10/5/17 1:39 AM:


No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application  is run synchronously by Spark Streaming framework. 
However, Spark framework has another thread that *asynchronously* serializes 
the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?


was (Author: michaeln_apache):
No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application runs synchronously. It is Spark Streaming framework 
that *asynchronously* serializes the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> 

[jira] [Created] (SPARK-22205) Incorrect result with user defined agg function followed by a non deterministic function

2017-10-04 Thread kanika dhuria (JIRA)
kanika dhuria created SPARK-22205:
-

 Summary: Incorrect result with  user defined agg function followed 
by a non deterministic function 
 Key: SPARK-22205
 URL: https://issues.apache.org/jira/browse/SPARK-22205
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: kanika dhuria


Repro 
Create a user defined function like 
lass AnyUdaf(dtype:DataType) extends UserDefinedAggregateFunction {
  def inputSchema:StructType = StructType(StructField("v", dtype) :: Nil)

  def bufferSchema:StructType = StructType(StructField("v", dtype) :: Nil)

  def dataType: DataType = dtype

  def deterministic: Boolean = true

  def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = null }

  def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
if (buffer(0) == null) buffer(0) = input(0)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
if(buffer1(0) == null) buffer1(0) = buffer2(0)
  }

  def evaluate(buffer: Row): Any = { buffer(0) }
}

Use this in an agg and follow it with non deterministic function like  
monotonically_increasing_id.
Seq(0,1).toDF("c1").select(col("c1"), lit(10)).toDF("c1", 
"c2").select(col("c1"), col("c2")).toDF("c1", "c2").groupBy(col("c1")).agg(new 
AnyUdaf()(col("c2"))).toDF("c1", "c2").select(lit(5), col("c2"), 
monotonically_increasing_id).show
+---+---+-+
|  5| c2|monotonically_increasing_id()|
+---+---+-+
|  5|10|0|
|  5|10|0|
+---+---+-+






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N reopened SPARK-22163:
---

It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers the the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N reopened SPARK-21999:
---

No, my app is *not* modifying a collection asynchronously w.r.t. Spark.  You  
misunderstood the issue completely and kept making the same invalid claim.   My 
Streaming Spark application runs synchronously. It is Spark Streaming framework 
that *asynchronously* serializes the application objects.

Re-read the original issue and provide the answers to the following questions 
instead of blindly closing issues:

1. In the first place, why does Spark serialize the application objects 
**asynchronously** while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch **synchronously** ?

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(Transf

[jira] [Created] (SPARK-22204) Explain output for SQL with commands shows no optimization

2017-10-04 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-22204:
--

 Summary: Explain output for SQL with commands shows no optimization
 Key: SPARK-22204
 URL: https://issues.apache.org/jira/browse/SPARK-22204
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Andrew Ash


When displaying the explain output for a basic SELECT query, the query plan 
changes as expected from analyzed -> optimized stages.  But when putting that 
same query into a command, for example {{CREATE TABLE}} it appears that the 
optimization doesn't take place.

In Spark shell:

Explain output for a {{SELECT}} statement shows optimization:
{noformat}
scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) 
AS b) AS c) AS d").explain(true)
== Parsed Logical Plan ==
'Project ['a]
+- 'SubqueryAlias d
   +- 'Project ['a]
  +- 'SubqueryAlias c
 +- 'Project ['a]
+- SubqueryAlias b
   +- Project [1 AS a#29]
  +- OneRowRelation

== Analyzed Logical Plan ==
a: int
Project [a#29]
+- SubqueryAlias d
   +- Project [a#29]
  +- SubqueryAlias c
 +- Project [a#29]
+- SubqueryAlias b
   +- Project [1 AS a#29]
  +- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS a#29]
+- OneRowRelation

== Physical Plan ==
*Project [1 AS a#29]
+- Scan OneRowRelation[]

scala> 
{noformat}

But the same command run inside {{CREATE TABLE}} does not:

{noformat}
scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM (SELECT 
a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true)
== Parsed Logical Plan ==
'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
Ignore
+- 'Project ['a]
   +- 'SubqueryAlias d
  +- 'Project ['a]
 +- 'SubqueryAlias c
+- 'Project ['a]
   +- SubqueryAlias b
  +- Project [1 AS a#33]
 +- OneRowRelation

== Analyzed Logical Plan ==
CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
InsertIntoHiveTable]
   +- Project [a#33]
  +- SubqueryAlias d
 +- Project [a#33]
+- SubqueryAlias c
   +- Project [a#33]
  +- SubqueryAlias b
 +- Project [1 AS a#33]
+- OneRowRelation

== Optimized Logical Plan ==
CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
InsertIntoHiveTable]
   +- Project [a#33]
  +- SubqueryAlias d
 +- Project [a#33]
+- SubqueryAlias c
   +- Project [a#33]
  +- SubqueryAlias b
 +- Project [1 AS a#33]
+- OneRowRelation

== Physical Plan ==
CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand 
[Database:default}, TableName: tmptable, InsertIntoHiveTable]
   +- Project [a#33]
  +- SubqueryAlias d
 +- Project [a#33]
+- SubqueryAlias c
   +- Project [a#33]
  +- SubqueryAlias b
 +- Project [1 AS a#33]
+- OneRowRelation

scala>
{noformat}

Note that there is no change between the analyzed and optimized plans when run 
in a command.


This is misleading my users, as they think that there is no optimization 
happening in the query!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21785) Support create table from a file schema

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21785:


Assignee: (was: Apache Spark)

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21785) Support create table from a file schema

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192244#comment-16192244
 ] 

Apache Spark commented on SPARK-21785:
--

User 'CrazyJacky' has created a pull request for this issue:
https://github.com/apache/spark/pull/19434

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21785) Support create table from a file schema

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21785:


Assignee: Apache Spark

> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21999.
---
Resolution: Not A Problem

I am not sure what you mean by bold-faced asynchronously. Your app is modifying 
a collection asynchronously w.r.t. Spark. Right? You said as much.
This is why I'm asking you to back up and present a specific example, on the 
mailing list.

The map function is not a flaw, in Spark or Scala or any other functional 
language. That is not the reason flatMap exists. If that's the same 
misunderstanding going on here, then, no, this isn't a problem.


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.

[jira] [Closed] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-21999.
-

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DSt

[jira] [Resolved] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22163.
---
Resolution: Duplicate

If one accepts your premise, then the bug is simply a manifestation of the 
'design flaw'. These aren't separate. Please don't keep opening this.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22163.
-

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192231#comment-16192231
 ] 

Sean Owen commented on SPARK-22201:
---

If you're saying that string-valued columns should never appear in the output: 
not sure I agree there's a problem. pandas will also output stats for such 
columns, for example, and simply leave mean/stdev blank or NaN. The behavior 
exists now, so would have to be a good reason that it's not allowed at all. 

If you're just saying it shouldn't be there by default, again not sure that's 
the least surprising thing. The API lets you choose a subset of columns. The 
natural default with no args is to show all cols.

pandas seems to have an argument that controls whether to show string cols or 
not. That's possible here too but so is simply selecting the columns of 
interest.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2017-10-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192217#comment-16192217
 ] 

yuhao yang commented on SPARK-3181:
---

Regarding to whether to separate Huber loss an an independent Estimator, I 
don't see there's an direct conflict.

IMO, LinearRegression should act as an all-in-one Estimator that allow user to 
combine whichever loss function, optimizer and regularization to use. It should 
targets flexibility and also provides some fundamental infrastructure for 
regression algorithms.

In the meantime, we may also support HuberRegression, RidgeRegression and 
others in independent Estimator, which is more convenient but with less 
flexibility (also allow specific parameters). As mentioned by Seth, this would 
require better code abstraction and plugin interface. Besides  
loss/prediction/optimizer, we also need to provide infrastructure for model 
summary and serialization. This should only happen after we can compose 
Estimator like HuberRegression without noticeable code duplication. 


> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Fan Jiang
>Assignee: Yanbo Liang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2017-10-04 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192201#comment-16192201
 ] 

Siddharth Murching edited comment on SPARK-3162 at 10/4/17 11:35 PM:
-

Commenting here to note that I'd like to resume work on this issue; I've made a 
new PR^


was (Author: siddharth murching):
Commenting here to note that I'm resuming work on this issue; I've made a new 
PR^

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2017-10-04 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192201#comment-16192201
 ] 

Siddharth Murching commented on SPARK-3162:
---

Commenting here to note that I'm resuming work on this issue; I've made a new 
PR^

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:33 PM:
-

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects, where it needs to make the same function millions and 
billions times for every single object. 

The questions posted previously and re-posted below are intended to provide the 
insights as to why this issue is a design flaw of Spark's framework trying to 
serialize application objects of a Streaming application that runs 
continuously.  Please make sure you understand the differences between code 
bugs vs design flaws first, and provide the answers to the questions below and 
resolve them, before respond further, instead of arbitrarily closing this 
ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192135#comment-16192135
 ] 

Michael N edited comment on SPARK-21999 at 10/4/17 10:32 PM:
-

The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects, where it needs to make the same function millions and 
billions times for every single object. 

On the other hand, the newer flatMap framework make one function call for a 
list of objects via the Iterator.



was (Author: michaeln_apache):
The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator.


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(M

[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192166#comment-16192166
 ] 

Apache Spark commented on SPARK-3162:
-

User 'smurching' has created a pull request for this issue:
https://github.com/apache/spark/pull/19433

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192135#comment-16192135
 ] 

Michael N edited comment on SPARK-21999 at 10/4/17 10:31 PM:
-

The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator.



was (Author: michaeln_apache):
The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.
1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator.


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStr

[jira] [Assigned] (SPARK-22203) Add job description for file listing Spark jobs

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22203:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add job description for file listing Spark jobs
> ---
>
> Key: SPARK-22203
> URL: https://issues.apache.org/jira/browse/SPARK-22203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> The user may be confused about some 1-tasks jobs. We can add a job 
> description for these jobs so that the user can figure it out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22203) Add job description for file listing Spark jobs

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22203:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add job description for file listing Spark jobs
> ---
>
> Key: SPARK-22203
> URL: https://issues.apache.org/jira/browse/SPARK-22203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The user may be confused about some 1-tasks jobs. We can add a job 
> description for these jobs so that the user can figure it out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22203) Add job description for file listing Spark jobs

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192162#comment-16192162
 ] 

Apache Spark commented on SPARK-22203:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/19432

> Add job description for file listing Spark jobs
> ---
>
> Key: SPARK-22203
> URL: https://issues.apache.org/jira/browse/SPARK-22203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The user may be confused about some 1-tasks jobs. We can add a job 
> description for these jobs so that the user can figure it out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:12 PM:
-

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192135#comment-16192135
 ] 

Michael N edited comment on SPARK-21999 at 10/4/17 10:13 PM:
-

The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.
1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator.



was (Author: michaeln_apache):
The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator.


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.app

[jira] [Issue Comment Deleted] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N updated SPARK-21999:
--
Comment: was deleted

(was: The posted questions were intended to show why this is a design issue. 
You have not been able to provide the answers. So before you respond further 
and claim otherwise, please re-read the posted questions and provide the 
answers to those questions first.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

Here is the analogy to give more guidance as to why this is a design flaw.  The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

)

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Op

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192131#comment-16192131
 ] 

Michael N commented on SPARK-21999:
---

The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

Here is the analogy to give more guidance as to why this is a design flaw.  The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 



> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   

[jira] [Reopened] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-04 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N reopened SPARK-21999:
---

The posted questions were intended to show why this is a design issue. You have 
not been able to provide the answers. So before you respond further and claim 
otherwise, please re-read the posted questions and provide the answers to those 
questions first.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects. On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator.


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:02 PM:
-

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Until you could provide the 
answers to the questions and resolve them, please do not close this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N reopened SPARK-22163:
---

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Until you could provide the 
answers to the questions and resolve them, please do not close this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22203) Add job description for file listing Spark jobs

2017-10-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-22203:


 Summary: Add job description for file listing Spark jobs
 Key: SPARK-22203
 URL: https://issues.apache.org/jira/browse/SPARK-22203
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The user may be confused about some 1-tasks jobs. We can add a job 
description for these jobs so that the user can figure it out.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns

2017-10-04 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191973#comment-16191973
 ] 

cold gin commented on SPARK-22201:
--

Ok I see what you mean - it is Dataset, not Dataframe.. Thank you for for 
pointing out the Scala doc. But I still don't see that it makes sense that they 
are co-mingled (that the describe() api selects *both* numeric and string). It 
seems like the numeric columns should be selected exclusively for the 
statistical values returned. I know that there are things like counts that are 
beneficial for strings, but "mean" and "stddev" are not coherent for strings, 
so this is my point about the separation of apis, or perhaps changing the 
default no-arg behavior, and adding strings only if requested. Also, there does 
not seem to be a straightforward api to filter out just numeric or string 
*type* columns (ie - filter by column type), which would make things a lot 
easier. I am having to use drop() for the string columns, but this is messy imo.
 

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15689) Data source API v2

2017-10-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-15689:
---

Assignee: Wenchen Fan

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream

2017-10-04 Thread Oleksandr Konopko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191776#comment-16191776
 ] 

Oleksandr Konopko commented on SPARK-18580:
---

https://github.com/apache/spark/pull/19431

This is more elegant version. Also I have added fix for both Kafka 0.10 and 0.8

> Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
> ---
>
> Key: SPARK-18580
> URL: https://issues.apache.org/jira/browse/SPARK-18580
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Oleg Muravskiy
>
> Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the 
> initial rate when the backpressure is enabled. This is too exhaustive for the 
> application while it still warms up.
> This is similar to SPARK-11627, applying the solution provided there to 
> DirectKafkaInputDStream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191767#comment-16191767
 ] 

Apache Spark commented on SPARK-18580:
--

User 'akonopko' has created a pull request for this issue:
https://github.com/apache/spark/pull/19431

> Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
> ---
>
> Key: SPARK-18580
> URL: https://issues.apache.org/jira/browse/SPARK-18580
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Oleg Muravskiy
>
> Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the 
> initial rate when the backpressure is enabled. This is too exhaustive for the 
> application while it still warms up.
> This is similar to SPARK-11627, applying the solution provided there to 
> DirectKafkaInputDStream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22202) Release tgz content differences for python and R

2017-10-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22202:
-
Description: 
As a follow up to SPARK-22167, currently we are running different 
profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
should consider if these differences are significant and whether they should be 
addressed.

[will add more info on this soon]

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> [will add more info on this soon]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22202) Release tgz content differences for python and R

2017-10-04 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-22202:


 Summary: Release tgz content differences for python and R
 Key: SPARK-22202
 URL: https://issues.apache.org/jira/browse/SPARK-22202
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SparkR
Affects Versions: 2.1.2, 2.2.1, 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-10-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22187:
-
Labels: release-notes releasenotes  (was: release-notes)

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>  Labels: release-notes, releasenotes
> Fix For: 2.2.0
>
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-10-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22187:
-
Labels: release-notes  (was: )

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>  Labels: release-notes
> Fix For: 2.2.0
>
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-10-04 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21871:

Priority: Critical  (was: Minor)

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Critical
> Fix For: 2.3.0
>
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-10-04 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21871.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-10-04 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21871:
---

Assignee: Takeshi Yamamuro

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22192) An RDD of nested POJO objects cannot be converted into a DataFrame using SQLContext.createDataFrame API

2017-10-04 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191526#comment-16191526
 ] 

Asif Hussain Shahid commented on SPARK-22192:
-

The PR opened for this bug is [https://github.com/SnappyDataInc/spark/pull/83]

> An RDD of nested POJO objects cannot be converted into a DataFrame using 
> SQLContext.createDataFrame API
> ---
>
> Key: SPARK-22192
> URL: https://issues.apache.org/jira/browse/SPARK-22192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Independent of OS / platform
>Reporter: Asif Hussain Shahid
>Priority: Minor
>
> If an RDD contains nested POJO objects, then SQLContext.createDataFrame(RDD, 
> Class) api only handles the top level POJO object. It throws ScalaMatchError 
> exception when handling the nested POJO object as the code does not 
> recursively handle the nested POJOs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22169) support byte length literal as identifier

2017-10-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22169:

Summary: support byte length literal as identifier  (was: table name with 
numbers and characters should be able to be parsed)

> support byte length literal as identifier
> -
>
> Key: SPARK-22169
> URL: https://issues.apache.org/jira/browse/SPARK-22169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22201) Dataframe describe includes string columns

2017-10-04 Thread cold gin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cold gin updated SPARK-22201:
-
Description: 
As per the api documentation, the default no-arg Dataframe describe() function 
should only include numerical column types, but it is including String types as 
well. This creates unusable statistical results (for example, max returns 
"V8903" for one of the string columns in my dataset), and this also leads to 
stacktraces when you run show() on the resulting dataframe returned from 
describe().

There also appears to be several related issues to this:

https://issues.apache.org/jira/browse/SPARK-16468

https://issues.apache.org/jira/browse/SPARK-16429

But SPARK-16429 does not make sense with what the default api says, and only 
Int, Double, etc (numeric) columns should be included when generating the 
statistics. 

Perhaps this reveals the need for a new function to produce stats that make 
sense only for string columns, or else an additional parameter to describe() to 
filter in/out certain column types? 

In summary, the *default* describe api behavior (no arg behavior) should not 
include string columns. Note that boolean columns are correctly excluded by 
describe()

  was:
As per the api documentation, the default no-arg Dataframe describe() function 
should only include numerical column types, but it is including String types as 
well. This creates unusable statistical results (for example, max returns 
"V8903" for one of the string columns in my dataset).

There also appears to be several related issues to this:

https://issues.apache.org/jira/browse/SPARK-16468

https://issues.apache.org/jira/browse/SPARK-16429

But SPARK-16429 does not make sense with what the default api says, and only 
Int, Double, etc (numeric) columns should be included when generating the 
statistics. 

Perhaps this reveals the need for a new function to produce stats that make 
sense only for string columns, or else an additional parameter to describe() to 
filter in/out certain column types? 

In summary, the *default* describe api behavior (no arg behavior) should not 
include string columns.


> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191461#comment-16191461
 ] 

Sean Owen commented on SPARK-22201:
---

This is just a duplicate of SPARK-16468
What API are you referring to? the scaladoc says it computes stats for numeric 
and string types.
It's supposed to return stats on strings. They have an ordering, so max/min is 
coherent.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset).
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22201) Dataframe describe includes string columns

2017-10-04 Thread cold gin (JIRA)
cold gin created SPARK-22201:


 Summary: Dataframe describe includes string columns
 Key: SPARK-22201
 URL: https://issues.apache.org/jira/browse/SPARK-22201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: cold gin


As per the api documentation, the default no-arg Dataframe describe() function 
should only include numerical column types, but it is including String types as 
well. This creates unusable statistical results (for example, max returns 
"V8903" for one of the string columns in my dataset).

There also appears to be several related issues to this:

https://issues.apache.org/jira/browse/SPARK-16468

https://issues.apache.org/jira/browse/SPARK-16429

But SPARK-16429 does not make sense with what the default api says, and only 
Int, Double, etc (numeric) columns should be included when generating the 
statistics. 

Perhaps this reveals the need for a new function to produce stats that make 
sense only for string columns, or else an additional parameter to describe() to 
filter in/out certain column types? 

In summary, the *default* describe api behavior (no arg behavior) should not 
include string columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2017-10-04 Thread Derek Kaknes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191424#comment-16191424
 ] 

Derek Kaknes commented on SPARK-10884:
--

[~WeichenXu123] Thanks for picking this up, my team is very much looking 
forward to this PR getting pulled in.  Is there anything we can do to help move 
the ball forward? Thx. Derek

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22200) Kinesis Receivers stops if Kinesis stream was re-sharded

2017-10-04 Thread Alex Mikhailau (JIRA)
Alex Mikhailau created SPARK-22200:
--

 Summary: Kinesis Receivers stops if Kinesis stream was re-sharded
 Key: SPARK-22200
 URL: https://issues.apache.org/jira/browse/SPARK-22200
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Alex Mikhailau
Priority: Critical


Seeing 

Cannot find the shard given the shardId shardId-4454

Cannot get the shard for this ProcessTask, so duplicate KPL user records in the 
event of resharding will not be dropped during deaggregation of Amazon Kinesis 
records.

after Kinesis stream re-sharding and receivers stop working altogether.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-10-04 Thread Dan Stine (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191255#comment-16191255
 ] 

Dan Stine commented on SPARK-20557:
---

Got it. Thank you, Sean.

> JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
> 
>
> Key: SPARK-20557
> URL: https://issues.apache.org/jira/browse/SPARK-20557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Jannik Arndt
>Assignee: Xiao Li
>  Labels: easyfix, jdbc, oracle, sql, timestamp
> Fix For: 2.3.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME 
> ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) 
> results in an error:
> {{Unsupported type -101}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> That is because the type 
> {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
>  (in Java since 1.8) is missing in 
> {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
>  
> This is similar to SPARK-7039.
> I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-04 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191232#comment-16191232
 ] 

Jorge Machado commented on SPARK-20055:
---

[~hyukjin.kwon] just added some docs. 

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191230#comment-16191230
 ] 

Apache Spark commented on SPARK-20055:
--

User 'jomach' has created a pull request for this issue:
https://github.com/apache/spark/pull/19429

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20055:


Assignee: (was: Apache Spark)

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20055:


Assignee: Apache Spark

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22199) Spark Job on YARN fails with executors "Slave registration failed"

2017-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22199:
--
Priority: Minor  (was: Major)

> Spark Job on YARN fails with executors "Slave registration failed"
> --
>
> Key: SPARK-22199
> URL: https://issues.apache.org/jira/browse/SPARK-22199
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Minor
>
> Spark Job on YARN Failed with max executors Failed.
> ApplicationMaster logs:
> {code}
> 17/09/28 04:18:27 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with FAILED (diag message: Max number of executor failures (3) reached)
> {code}
> Checking the failed container logs shows "Slave registration failed: 
> Duplicate executor ID" whereas the Driver logs shows it has removed those 
> executors as they are idle for spark.dynamicAllocation.executorIdleTimeout
> Executor Logs:
> {code}
> 17/09/28 04:18:26 ERROR CoarseGrainedExecutorBackend: Slave registration 
> failed: Duplicate executor ID: 122
> {code}
> Driver logs:
> {code}
> 17/09/28 04:18:21 INFO ExecutorAllocationManager: Removing executor 122 
> because it has been idle for 60 seconds (new desired total will be 133)
> {code}
> There are two issues here:
> 1. Error Message in executor is misleading "Slave registration failed: 
> Duplicate executor ID"  as the actual error is it was idle
> 2. The job failed as there are executors idle for 
> spark.dynamicAllocation.executorIdleTimeout
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22199) Spark Job on YARN fails with executors "Slave registration failed"

2017-10-04 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created SPARK-22199:
-

 Summary: Spark Job on YARN fails with executors "Slave 
registration failed"
 Key: SPARK-22199
 URL: https://issues.apache.org/jira/browse/SPARK-22199
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.3
Reporter: Prabhu Joseph


Spark Job on YARN Failed with max executors Failed.

ApplicationMaster logs:
{code}
17/09/28 04:18:27 INFO ApplicationMaster: Unregistering ApplicationMaster with 
FAILED (diag message: Max number of executor failures (3) reached)
{code}

Checking the failed container logs shows "Slave registration failed: Duplicate 
executor ID" whereas the Driver logs shows it has removed those executors as 
they are idle for spark.dynamicAllocation.executorIdleTimeout

Executor Logs:
{code}
17/09/28 04:18:26 ERROR CoarseGrainedExecutorBackend: Slave registration 
failed: Duplicate executor ID: 122
{code}

Driver logs:
{code}
17/09/28 04:18:21 INFO ExecutorAllocationManager: Removing executor 122 because 
it has been idle for 60 seconds (new desired total will be 133)
{code}

There are two issues here:

1. Error Message in executor is misleading "Slave registration failed: 
Duplicate executor ID"  as the actual error is it was idle
2. The job failed as there are executors idle for 
spark.dynamicAllocation.executorIdleTimeout
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22131) Add Mesos Secrets Support to the Mesos Driver

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22131:


Assignee: (was: Apache Spark)

> Add Mesos Secrets Support to the Mesos Driver
> -
>
> Key: SPARK-22131
> URL: https://issues.apache.org/jira/browse/SPARK-22131
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>
> We recently added Secrets support to the Dispatcher (SPARK-20812). In order 
> to have Driver-to-Executor TLS we need the same support in the Mesos Driver 
> so a secret can be disseminated to the executors. This JIRA is to move the 
> current secrets implementation to be used by both frameworks. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22131) Add Mesos Secrets Support to the Mesos Driver

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22131:


Assignee: Apache Spark

> Add Mesos Secrets Support to the Mesos Driver
> -
>
> Key: SPARK-22131
> URL: https://issues.apache.org/jira/browse/SPARK-22131
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>Assignee: Apache Spark
>
> We recently added Secrets support to the Dispatcher (SPARK-20812). In order 
> to have Driver-to-Executor TLS we need the same support in the Mesos Driver 
> so a secret can be disseminated to the executors. This JIRA is to move the 
> current secrets implementation to be used by both frameworks. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22131) Add Mesos Secrets Support to the Mesos Driver

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191155#comment-16191155
 ] 

Apache Spark commented on SPARK-22131:
--

User 'susanxhuynh' has created a pull request for this issue:
https://github.com/apache/spark/pull/19428

> Add Mesos Secrets Support to the Mesos Driver
> -
>
> Key: SPARK-22131
> URL: https://issues.apache.org/jira/browse/SPARK-22131
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>
> We recently added Secrets support to the Dispatcher (SPARK-20812). In order 
> to have Driver-to-Executor TLS we need the same support in the Mesos Driver 
> so a secret can be disseminated to the executors. This JIRA is to move the 
> current secrets implementation to be used by both frameworks. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191037#comment-16191037
 ] 

Sean Owen commented on SPARK-22195:
---

 When I've heard this, it's been in the context of supporting this similarity 
in other implementations. For example RowMatrix already does all-pairs cosine 
similarity. And here's an item for supporting it in k-means: 
https://issues.apache.org/jira/browse/SPARK-22119

I'm skeptical it's worth providing this as a stand-alone utility method but I 
can't feel strongly about it.

> Add cosine similarity to org.apache.spark.ml.linalg.Vectors
> ---
>
> Key: SPARK-22195
> URL: https://issues.apache.org/jira/browse/SPARK-22195
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://en.wikipedia.org/wiki/Cosine_similarity:
> As the most important measure of similarity, I found it quite useful in some 
> image and NLP applications according to personal experience.
> Suggest to add function for cosine similarity in 
> org.apache.spark.ml.linalg.Vectors.
> Interface:
>   def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
>   def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
> Double = ...
> Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-04 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16189481#comment-16189481
 ] 

Saktheesh Balaraj edited comment on SPARK-14172 at 10/4/17 9:21 AM:


Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partitions. when joining 2 tables based on partition it's 
going for full table scan in table A i.e using all 1000 partitions instead of 
taking 2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: 
{noformat} select trans_date, a.trans_hour from table B {noformat}
step2: 
{noformat} select * from tableA where trans_date= and 
a.trans_hour = {noformat}




was (Author: saktheesh):
Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: 
{noformat} select trans_date, a.trans_hour from table B {noformat}
step2: 
{noformat} select * from tableA where trans_date= and 
a.trans_hour = {noformat}



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2017-10-04 Thread Akos Tomasits (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190904#comment-16190904
 ] 

Akos Tomasits edited comment on SPARK-12606 at 10/4/17 9:02 AM:


We have run into the same issue. We cannot create proper Java transformers 
derived from UnaryTransformer.

We would like to use these custom transformers through CrossValidator, that in 
the end requires a constructor with a string (uid) parameter. I guess the 
custom transformer is supposed to set the provided uid in this constructor, 
however, the object's uid() method is called before the constructor finishes. 
This leads to the above mentioned "null__inputCol" error.

I have created a new JIRA issue for this problem: SPARK-22198


was (Author: akos.tomasits):
We have run into the same issue. We cannot create proper Java transformers 
derived from UnaryTransformer.

We would like to use these custom transformers through CrossValidator, that in 
the end requires a constructor with a string (uid) parameter. I guess the 
custom transformer is supposed to set the provided uid in this constructor, 
however, the object's uid() method is called before the constructor finishes. 
This leads to the above mentioned "null__inputCol" error.

> Scala/Java compatibility issue Re: how to extend java transformer from Scala 
> UnaryTransformer ?
> ---
>
> Key: SPARK-12606
> URL: https://issues.apache.org/jira/browse/SPARK-12606
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
> Environment: Java 8, Mac OS, Spark-1.5.2
>Reporter: Andrew Davidson
>  Labels: transformers
>
> Hi Andy,
> I suspect that you hit the Scala/Java compatibility issue, I can also 
> reproduce this issue, so could you file a JIRA to track this issue?
> Yanbo
> 2016-01-02 3:38 GMT+08:00 Andy Davidson :
> I am trying to write a trivial transformer I use use in my pipeline. I am 
> using java and spark 1.5.2. It was suggested that I use the Tokenize.scala 
> class as an example. This should be very easy how ever I do not understand 
> Scala, I am having trouble debugging the following exception.
> Any help would be greatly appreciated.
> Happy New Year
> Andy
> java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
> does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:436)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:422)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at 
> org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
>   at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)
> public class StemmerTest extends AbstractSparkTest {
> @Test
> public void test() {
> Stemmer stemmer = new Stemmer()
> .setInputCol("raw”) //line 30
> .setOutputCol("filtered");
> }
> }
> /**
>  * @ see 
> spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>  * @ see 
> https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>  * @ see 
> http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
>  * 
>  * @author andrewdavidson
>  *
>  */
> public class Stemmer extends UnaryTransformer, List, 
> Stemmer> implements Serializable{
> static Logger logger = LoggerFactory.getLogger(Stemmer.class);
> private static final long serialVersionUID = 1L;
> private static final  ArrayType inputType = 
> DataTypes.createArrayType(DataTypes.StringType, true);
> private final String uid = Stemmer.class.getSimpleName() + "_" + 
> UUID.randomUUID().toString();
> @Override
> public String uid() {
> return uid;
> }
> /*
>override protected def validateInputType(inputType: DataType): Unit = {
> require(inputType == StringType, s"Input type must be string type but got 
> $inputType.")
>   }
>  */
> @Override
> public void validateInputType(DataType inputTypeArg) {
> String msg = "inputType must be " + inputType.simpleString() + " but 
> got " + inputTypeArg.simpleString();
> assert (inputType.equals(inputTypeArg)) : msg; 
> }
> 
> @Override
> public Function1, List> createTransformFunc() {
> // 
> http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
> Function1, List> f = n

[jira] [Updated] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akos Tomasits updated SPARK-22198:
--
Description: 
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, it is not possible to do it 
in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{ // This method is called by parent class, before 
object creation finishes
  return sparkUid;
}

...
{quote}


  was:
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, it is not possible to do it 
in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid; // This returns null, before object creation finishes
}

...
{quote}



> Java incompatibility when extending UnaryTransformer
> 
>
> Key: SPARK-22198
> URL: https://issues.apache.org/jira/browse/SPARK-22198
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 2.2.0
>Reporter: Akos Tomasits
>
> It is not possible to create proper Java custom Transformer by extending 
> UnaryTransformer.
> It seems that the method 'uid()' is called during object creation before the 
> provided 'uid' constructor parameter could be set.
> This leads to the following error:
> {quote}
>  java.lang.IllegalArgumentException: requirement failed: Param 
> _1563950936fa__inputCol does not belong to _d4105b75c4aa.
> {quote}
> If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
> you will need to explicitly include a constructor, which receives a String 
> parameter. As I saw in the source of built in transformers, this parameter is 
> a 'uid', which should be set in the object. However, it is not possible to do 
> it in time, because the uid() method is invoked (and its result might be 
> used) before this constructor finishes.
> Sample class:
> {quote}
> public class TextCleaner extends UnaryTransformer
> implements Serializable, DefaultParamsWritable, 
> DefaultParamsReadable \{
> private static final long serialVersionUID = 2658543236303100458L;
> 
> private static final String sparkUidPrefix = "TextCleaner";
> 
> private final String sparkUid;
> public TextCleaner() \{
>   sparkUid = 
> org.

[jira] [Updated] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akos Tomasits updated SPARK-22198:
--
Description: 
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, it is not possible to do it 
in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid; // This returns null, before object creation finishes
}

...
{quote}


  was:
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, it is not possible to do it 
in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}



> Java incompatibility when extending UnaryTransformer
> 
>
> Key: SPARK-22198
> URL: https://issues.apache.org/jira/browse/SPARK-22198
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 2.2.0
>Reporter: Akos Tomasits
>
> It is not possible to create proper Java custom Transformer by extending 
> UnaryTransformer.
> It seems that the method 'uid()' is called during object creation before the 
> provided 'uid' constructor parameter could be set.
> This leads to the following error:
> {quote}
>  java.lang.IllegalArgumentException: requirement failed: Param 
> _1563950936fa__inputCol does not belong to _d4105b75c4aa.
> {quote}
> If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
> you will need to explicitly include a constructor, which receives a String 
> parameter. As I saw in the source of built in transformers, this parameter is 
> a 'uid', which should be set in the object. However, it is not possible to do 
> it in time, because the uid() method is invoked (and its result might be 
> used) before this constructor finishes.
> Sample class:
> {quote}
> public class TextCleaner extends UnaryTransformer
> implements Serializable, DefaultParamsWritable, 
> DefaultParamsReadable \{
> private static final long serialVersionUID = 2658543236303100458L;
> 
> private static final String sparkUidPrefix = "TextCleaner";
> 
> private final String sparkUid;
> public TextCleaner() \{
>   sparkUid = 
> org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);

[jira] [Updated] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akos Tomasits updated SPARK-22198:
--
Component/s: Java API

> Java incompatibility when extending UnaryTransformer
> 
>
> Key: SPARK-22198
> URL: https://issues.apache.org/jira/browse/SPARK-22198
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 2.2.0
>Reporter: Akos Tomasits
>
> It is not possible to create proper Java custom Transformer by extending 
> UnaryTransformer.
> It seems that the method 'uid()' is called during object creation before the 
> provided 'uid' constructor parameter could be set.
> This leads to the following error:
> {quote}
>  java.lang.IllegalArgumentException: requirement failed: Param 
> _1563950936fa__inputCol does not belong to _d4105b75c4aa.
> {quote}
> If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
> you will need to explicitly include a constructor, which receives a String 
> parameter. As I saw in the source of built in transformers, this parameter is 
> a 'uid', which should be set in the object. However, it is not possible to do 
> it in time, because the uid() method is invoked (and its result might be 
> used) before this constructor finishes.
> Sample class:
> {quote}
> public class TextCleaner extends UnaryTransformer
> implements Serializable, DefaultParamsWritable, 
> DefaultParamsReadable \{
> private static final long serialVersionUID = 2658543236303100458L;
> 
> private static final String sparkUidPrefix = "TextCleaner";
> 
> private final String sparkUid;
> public TextCleaner() \{
>   sparkUid = 
> org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
>   }
>   public TextCleaner(String uid) \{
>  sparkUid = uid;
> }
> 
> @Override
> public String uid() \{
>   return sparkUid;
> }
> ...
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akos Tomasits updated SPARK-22198:
--
Description: 
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, it is not possible to do it 
in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}


  was:
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, this is not possible to do 
it in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}



> Java incompatibility when extending UnaryTransformer
> 
>
> Key: SPARK-22198
> URL: https://issues.apache.org/jira/browse/SPARK-22198
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Akos Tomasits
>
> It is not possible to create proper Java custom Transformer by extending 
> UnaryTransformer.
> It seems that the method 'uid()' is called during object creation before the 
> provided 'uid' constructor parameter could be set.
> This leads to the following error:
> {quote}
>  java.lang.IllegalArgumentException: requirement failed: Param 
> _1563950936fa__inputCol does not belong to _d4105b75c4aa.
> {quote}
> If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
> you will need to explicitly include a constructor, which receives a String 
> parameter. As I saw in the source of built in transformers, this parameter is 
> a 'uid', which should be set in the object. However, it is not possible to do 
> it in time, because the uid() method is invoked (and its result might be 
> used) before this constructor finishes.
> Sample class:
> {quote}
> public class TextCleaner extends UnaryTransformer
> implements Serializable, DefaultParamsWritable, 
> DefaultParamsReadable \{
> private static final long serialVersionUID = 2658543236303100458L;
> 
> private static final String sparkUidPrefix = "TextCleaner";
> 
> private final String sparkUid;
> public TextCleaner() \{
>   sparkUid = 
> org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
>   }
>   public TextCleaner(String uid) \{
>

[jira] [Updated] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akos Tomasits updated SPARK-22198:
--
Description: 
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, this is not possible to do 
it in time, because the uid() method is invoked (and its result might be used) 
before this constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}


  was:
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, this is not possible, 
because the uid() method is invoked (and its result might be used) before this 
constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}



> Java incompatibility when extending UnaryTransformer
> 
>
> Key: SPARK-22198
> URL: https://issues.apache.org/jira/browse/SPARK-22198
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Akos Tomasits
>
> It is not possible to create proper Java custom Transformer by extending 
> UnaryTransformer.
> It seems that the method 'uid()' is called during object creation before the 
> provided 'uid' constructor parameter could be set.
> This leads to the following error:
> {quote}
>  java.lang.IllegalArgumentException: requirement failed: Param 
> _1563950936fa__inputCol does not belong to _d4105b75c4aa.
> {quote}
> If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
> you will need to explicitly include a constructor, which receives a String 
> parameter. As I saw in the source of built in transformers, this parameter is 
> a 'uid', which should be set in the object. However, this is not possible to 
> do it in time, because the uid() method is invoked (and its result might be 
> used) before this constructor finishes.
> Sample class:
> {quote}
> public class TextCleaner extends UnaryTransformer
> implements Serializable, DefaultParamsWritable, 
> DefaultParamsReadable \{
> private static final long serialVersionUID = 2658543236303100458L;
> 
> private static final String sparkUidPrefix = "TextCleaner";
> 
> private final String sparkUid;
> public TextCleaner() \{
>   sparkUid = 
> org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
>   }
>   public TextCleaner(String uid) \{
>  sparkUi

[jira] [Updated] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akos Tomasits updated SPARK-22198:
--
Description: 
It is not possible to create proper Java custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, this is not possible, 
because the uid() method is invoked (and its result might be used) before this 
constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}


  was:
It is not possible to create proper custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, this is not possible, 
because the uid() method is invoked (and its result might be used) before this 
constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}



> Java incompatibility when extending UnaryTransformer
> 
>
> Key: SPARK-22198
> URL: https://issues.apache.org/jira/browse/SPARK-22198
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Akos Tomasits
>
> It is not possible to create proper Java custom Transformer by extending 
> UnaryTransformer.
> It seems that the method 'uid()' is called during object creation before the 
> provided 'uid' constructor parameter could be set.
> This leads to the following error:
> {quote}
>  java.lang.IllegalArgumentException: requirement failed: Param 
> _1563950936fa__inputCol does not belong to _d4105b75c4aa.
> {quote}
> If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
> you will need to explicitly include a constructor, which receives a String 
> parameter. As I saw in the source of built in transformers, this parameter is 
> a 'uid', which should be set in the object. However, this is not possible, 
> because the uid() method is invoked (and its result might be used) before 
> this constructor finishes.
> Sample class:
> {quote}
> public class TextCleaner extends UnaryTransformer
> implements Serializable, DefaultParamsWritable, 
> DefaultParamsReadable \{
> private static final long serialVersionUID = 2658543236303100458L;
> 
> private static final String sparkUidPrefix = "TextCleaner";
> 
> private final String sparkUid;
> public TextCleaner() \{
>   sparkUid = 
> org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
>   }
>   public TextCleaner(String uid) \{
>  sparkUid = uid;
> }
> 
> @Over

[jira] [Created] (SPARK-22198) Java incompatibility when extending UnaryTransformer

2017-10-04 Thread Akos Tomasits (JIRA)
Akos Tomasits created SPARK-22198:
-

 Summary: Java incompatibility when extending UnaryTransformer
 Key: SPARK-22198
 URL: https://issues.apache.org/jira/browse/SPARK-22198
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Akos Tomasits


It is not possible to create proper custom Transformer by extending 
UnaryTransformer.

It seems that the method 'uid()' is called during object creation before the 
provided 'uid' constructor parameter could be set.

This leads to the following error:

{quote}
 java.lang.IllegalArgumentException: requirement failed: Param 
_1563950936fa__inputCol does not belong to _d4105b75c4aa.
{quote}

If you extend UnaryTransformer and try to use it e.g. through CrossValidator, 
you will need to explicitly include a constructor, which receives a String 
parameter. As I saw in the source of built in transformers, this parameter is a 
'uid', which should be set in the object. However, this is not possible, 
because the uid() method is invoked (and its result might be used) before this 
constructor finishes.

Sample class:

{quote}
public class TextCleaner extends UnaryTransformer
implements Serializable, DefaultParamsWritable, 
DefaultParamsReadable \{

private static final long serialVersionUID = 2658543236303100458L;

private static final String sparkUidPrefix = "TextCleaner";

private final String sparkUid;

public TextCleaner() \{
sparkUid = 
org.apache.spark.ml.util.Identifiable$.MODULE$.randomUID(sparkUidPrefix);
}

public TextCleaner(String uid) \{
 sparkUid = uid;
}

@Override
public String uid() \{  
  return sparkUid;
}

...
{quote}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics

2017-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190920#comment-16190920
 ] 

Apache Spark commented on SPARK-22190:
--

User 'LucaCanali' has created a pull request for this issue:
https://github.com/apache/spark/pull/19426

> Add Spark executor task metrics to Dropwizard metrics
> -
>
> Key: SPARK-22190
> URL: https://issues.apache.org/jira/browse/SPARK-22190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: SparkTaskMetrics_Grafana_example.PNG
>
>
> I would like to propose to expose Spark executor task metrics using the 
> Dropwizard metrics. I have developed a simple implementation and run a few 
> tests using Graphite sink and Grafana visualization and this appears to me a 
> good source of information for monitoring and troubleshooting the progress of 
> Spark jobs. I attach a screenshot of an example graph generated with Grafana.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22190:


Assignee: (was: Apache Spark)

> Add Spark executor task metrics to Dropwizard metrics
> -
>
> Key: SPARK-22190
> URL: https://issues.apache.org/jira/browse/SPARK-22190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: SparkTaskMetrics_Grafana_example.PNG
>
>
> I would like to propose to expose Spark executor task metrics using the 
> Dropwizard metrics. I have developed a simple implementation and run a few 
> tests using Graphite sink and Grafana visualization and this appears to me a 
> good source of information for monitoring and troubleshooting the progress of 
> Spark jobs. I attach a screenshot of an example graph generated with Grafana.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics

2017-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22190:


Assignee: Apache Spark

> Add Spark executor task metrics to Dropwizard metrics
> -
>
> Key: SPARK-22190
> URL: https://issues.apache.org/jira/browse/SPARK-22190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
> Attachments: SparkTaskMetrics_Grafana_example.PNG
>
>
> I would like to propose to expose Spark executor task metrics using the 
> Dropwizard metrics. I have developed a simple implementation and run a few 
> tests using Graphite sink and Grafana visualization and this appears to me a 
> good source of information for monitoring and troubleshooting the progress of 
> Spark jobs. I attach a screenshot of an example graph generated with Grafana.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >