[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-10-05 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194170#comment-16194170
 ] 

Takeshi Yamamuro commented on SPARK-22211:
--

Probably, the suggested solution does not work when both-side tables are small 
and a limit value is high. IMHO one option to solve this is just to stop the 
pushdown for full-outer joins (because I couldn't find better solutions to 
solve this smartly...).

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8515) Improve ML attribute API

2017-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194142#comment-16194142
 ] 

Apache Spark commented on SPARK-8515:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19442

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: SPARK-8515.pdf
>
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8515) Improve ML attribute API

2017-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8515:
---

Assignee: (was: Apache Spark)

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: SPARK-8515.pdf
>
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8515) Improve ML attribute API

2017-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8515:
---

Assignee: Apache Spark

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: advanced
> Attachments: SPARK-8515.pdf
>
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194141#comment-16194141
 ] 

Felix Cheung commented on SPARK-22202:
--

[~holden.ka...@gmail.com] actually, I think for R we would go the other way - 
we would want to include what's in hadoop2.6 only in all other release profiles 
(ie. run *this* then create tgz)
so I think the approaches are potentially opposite for R and python.

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22202:
-
Priority: Minor  (was: Major)

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8515) Improve ML attribute API

2017-10-05 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194074#comment-16194074
 ] 

Liang-Chi Hsieh edited comment on SPARK-8515 at 10/6/17 4:52 AM:
-

I'm working on a new ML attribute API which is supposed to solve the issues 
like SPARK-19141 and SPARK-21926 and maybe SPARK-12886.

The basic attribute design is similar to previous API. But new API supports 
sparse Vector attributes. So we don't need to keep every attributes in a Vector 
column. The new API is also trimming down some unnecessary parts in previous 
API.

The design doc is attached in this JIRA.

cc [~mlnick]


was (Author: viirya):
I'm working on a new ML attribute API which is supposed to solve the issues 
like SPARK-19141 and SPARK-21926 and maybe SPARK-12886.

The basic attribute design is similar to previous API. But new API supports 
sparse Vector attributes. So we don't need to keep every attributes in a Vector 
column. The new API is also trimming down some unnecessary parts in previous 
API.

cc [~mlnick]

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: SPARK-8515.pdf
>
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8515) Improve ML attribute API

2017-10-05 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-8515:
---
Attachment: SPARK-8515.pdf

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: SPARK-8515.pdf
>
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-05 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194129#comment-16194129
 ] 

Jorge Machado commented on SPARK-20055:
---

[~aash] Should I copy paste that options ?  And there is some docs already


{code:java}
def
csv(paths: String*): DataFrame
 Permalink
Loads CSV files and returns the result as a DataFrame.

This function will go through the input once to determine the input schema if 
inferSchema is enabled. To avoid going through the entire data once, disable 
inferSchema option or specify the schema explicitly using schema.

You can set the following CSV-specific options to deal with CSV files:

sep (default ,): sets the single character as a separator for each field and 
value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values 
where the separator can be part of the value. If you would like to turn off 
quotations, you need to set not null but an empty string. This behaviour is 
different from com.databricks.spark.csv.
escape (default \): sets the single character used for escaping quotes inside 
an already quoted value.
comment (default empty string): sets the single character used for skipping 
lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. 
It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): a flag indicating whether or not 
leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): a flag indicating whether or not 
trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null 
value. Since 2.0.1, this applies to all supported types including the string 
type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive 
infinity value.
negativeInf (default -Inf): sets the string representation of a negative 
infinity value.
dateFormat (default -MM-dd): sets the string that indicates a date format. 
Custom date formats follow the formats at java.text.SimpleDateFormat. This 
applies to date type.
timestampFormat (default -MM-dd'T'HH:mm:ss.SSSXXX): sets the string that 
indicates a timestamp format. Custom date formats follow the formats at 
java.text.SimpleDateFormat. This applies to timestamp type.
maxColumns (default 20480): defines a hard limit of how many columns a record 
can have.
maxCharsPerColumn (default -1): defines the maximum number of characters 
allowed for any given value being read. By default, it is -1 meaning unlimited 
length
mode (default PERMISSIVE): allows a mode for dealing with corrupt records 
during parsing. It supports the following case-insensitive modes.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a field configured by columnNameOfCorruptRecord. 
To keep corrupt records, an user can set a string type field named 
columnNameOfCorruptRecord in an user-defined schema. If a schema does not have 
the field, it drops corrupt records during parsing. When a length of parsed CSV 
tokens is shorter than an expected length of a schema, it sets null for extra 
fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in 
spark.sql.columnNameOfCorruptRecord): allows renaming the new field having 
malformed string created by PERMISSIVE mode. This overrides 
spark.sql.columnNameOfCorruptRecord.
multiLine (default false): parse one record, which may span multiple lines.
{code}

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the 

[jira] [Resolved] (SPARK-22159) spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-22159.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> spark.sql.execution.arrow.enable and 
> spark.sql.codegen.aggregate.map.twolevel.enable -> enabled
> ---
>
> Key: SPARK-22159
> URL: https://issues.apache.org/jira/browse/SPARK-22159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>
> We should make the config names consistent. They are supposed to end with 
> ".enabled", rather than "enable".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22153) Rename ShuffleExchange -> ShuffleExchangeExec

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-22153.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Rename ShuffleExchange -> ShuffleExchangeExec
> -
>
> Key: SPARK-22153
> URL: https://issues.apache.org/jira/browse/SPARK-22153
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>
> For some reason when we added the Exec suffix to all physical operators, we 
> missed this one. I was looking for this physical operator today and couldn't 
> find it, because I was looking for ExchangeExec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues

2017-10-05 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194075#comment-16194075
 ] 

Liang-Chi Hsieh commented on SPARK-19141:
-

I'm working on a new ML attribute API (SPARK-8515) which is supposed to solve 
the issues like SPARK-19141 and SPARK-21926 and maybe SPARK-12886.

The basic attribute design is similar to previous API. But new API supports 
sparse Vector attributes. So we don't need to keep every attributes in a Vector 
column. The new API is also trimming down some unnecessary parts in previous 
API.

> VectorAssembler metadata causing memory issues
> --
>
> Key: SPARK-19141
> URL: https://issues.apache.org/jira/browse/SPARK-19141
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
> Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, 
> 2.0.0, 2.1.0
>Reporter: Antonia Oprescu
>
> VectorAssembler produces unnecessary metadata that overflows the Java heap in 
> the case of sparse vectors. In the example below, the logical length of the 
> vector is 10^6, but the number of non-zero values is only 2.
> The problem arises when the vector assembler creates metadata (ML attributes) 
> for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. 
> HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it 
> produces:
> {noformat}
> {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}}
> {noformat}
> In this lightweight example, the feature size limit seems to be 1,000,000 
> when run locally, but this scales poorly with more complicated routines. With 
> a larger dataset and a learner (say LogisticRegression), it maxes out 
> anywhere between 10k and 100k hash size even on a decent sized cluster.
> I did some digging, and it seems that the only metadata necessary for 
> downstream learners is the one indicating categorical columns. Thus, I 
> thought of the following possible solutions:
> 1. Compact representation of ml attributes metadata (but this seems to be a 
> bigger change)
> 2. Removal of non-categorical tags from the metadata created by the 
> VectorAssembler
> 3. An option on the existent VectorAssembler to skip unnecessary ml 
> attributes or create another transformer altogether
> I would happy to take a stab at any of these solutions, but I need some 
> direction from the Spark community.
> {code:title=VABug.scala |borderStyle=solid}
> import org.apache.spark.SparkConf
> import org.apache.spark.ml.feature.{HashingTF, VectorAssembler}
> import org.apache.spark.sql.SparkSession
> object VARepro {
>   case class Record(Label: Double, Feat01: Double, Feat02: Array[String])
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .setAppName("Vector assembler bug")
>   .setMaster("local[*]")
> val spark = SparkSession.builder.config(conf).getOrCreate()
> import spark.implicits._
> val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, 
> Array("a9ee"))).toDS()
> val numFeatures = 1000
> val hashingScheme = new 
> HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures)
> val hashedData = hashingScheme.transform(df)
> val vectorAssembler = new 
> VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features")
> val processedData = vectorAssembler.transform(hashedData).select("Label", 
> "Features")
> processedData.show()
>   }
> }
> {code}
> *Stacktrace from the example above:*
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.copy(attributes.scala:272)
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:215)
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:195)
>   at 
> org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:71)
>   at 
> org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:70)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.copyToArray(SeqViewLike.scala:37)
>   at 
> 

[jira] [Commented] (SPARK-8515) Improve ML attribute API

2017-10-05 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194074#comment-16194074
 ] 

Liang-Chi Hsieh commented on SPARK-8515:


I'm working on a new ML attribute API which is supposed to solve the issues 
like SPARK-19141 and SPARK-21926 and maybe SPARK-12886.

The basic attribute design is similar to previous API. But new API supports 
sparse Vector attributes. So we don't need to keep every attributes in a Vector 
column. The new API is also trimming down some unnecessary parts in previous 
API.

cc [~mlnick]

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-10-05 Thread Benyi Wang (JIRA)
Benyi Wang created SPARK-22211:
--

 Summary: LimitPushDown optimization for FullOuterJoin generates 
wrong results
 Key: SPARK-22211
 URL: https://issues.apache.org/jira/browse/SPARK-22211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: on community.cloude.databrick.com 
Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
Reporter: Benyi Wang


LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
generate a wrong result:

Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
is selected, but at right side we have 100K rows including 999, the result will 
be
- one row is (999, 999)
- the rest rows are (null, xxx)

Once you call show(), the row (999,999) has only 1/10th chance to be 
selected by CollectLimit.

The actual optimization might be, 
- push down limit
- but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.

Here is my notebook:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
{code:java}
import scala.util.Random._

val dl = shuffle(1 to 10).toDF("id")
val dr = shuffle(1 to 10).toDF("id")

println("data frame dl:")
dl.explain

println("data frame dr:")
dr.explain

val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)

j.explain

j.show(false)
{code}

{code}
data frame dl:
== Physical Plan ==
LocalTableScan [id#10]
data frame dr:
== Physical Plan ==
LocalTableScan [id#16]
== Physical Plan ==
CollectLimit 1
+- SortMergeJoin [id#10], [id#16], FullOuter
   :- *Sort [id#10 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#10, 200)
   : +- *LocalLimit 1
   :+- LocalTableScan [id#10]
   +- *Sort [id#16 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(id#16, 200)
 +- LocalTableScan [id#16]
import scala.util.Random._
dl: org.apache.spark.sql.DataFrame = [id: int]
dr: org.apache.spark.sql.DataFrame = [id: int]
j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]

++---+
|id  |id |
++---+
|null|148|
++---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N updated SPARK-22163:
--
Description: 
The application objects can contain List and can be modified dynamically as 
well.   However, Spark Streaming framework asynchronously serializes the 
application's objects as the application runs.  Therefore, it causes random 
run-time exception on the List when Spark Streaming framework happens to 
serializes the application's objects while the application modifies a List in 
its own object.  

In fact, there are multiple bugs reported about

Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject

that are permutation of the same root cause. So the design issue of Spark 
streaming framework is that it should do this serialization asynchronously.  
Instead, it should either

1. do this serialization synchronously. This is preferred to eliminate the 
issue completely.  Or

2. Allow it to be configured per application whether to do this serialization 
synchronously or asynchronously, depending on the nature of each application.

Also, Spark documentation should describe the conditions that trigger Spark to 
do this type of serialization asynchronously, so the applications can work 
around them until the fix is provided. 

===

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

—

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for thread #3 until it is done. => So 
there is already synchronization among the threads within the same batch. Also, 
batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch. In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException.  That is the root cause of 
this issue.

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers. In 
thread #1 the driver could have updated integer X and before it could update 
integer Y, when Spark's thread #4 asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data. 
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1. That way, it is up to individual applications to decide based on the nature 
of their business requirements and needed throughput.



  was:
The application objects can contain List and can be modified dynamically as 
well.   However, Spark Streaming framework asynchronously serializes the 
application's objects as the application runs.  Therefore, it causes random 
run-time exception on the List when Spark Streaming framework happens to 
serializes the application's objects while the application modifies a List in 
its own object.  

In fact, there are multiple bugs reported about

Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject

that are permutation of the same root cause. So the design issue of Spark 
streaming framework is that it should do this serialization asynchronously.  
Instead, it should either

1. do this serialization 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/6/17 12:24 AM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for  thread #3 until it is done.  => 
So there is already synchronization among the threads within the same batch. 
Also, batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. That is the root cause of 
this issue.

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  In 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for  thread #3 until it is done.  => 
So there is already synchronization among the threads within the same batch. 
Also, batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/6/17 12:21 AM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for  thread #3 until it is done.  => 
So there is already synchronization among the threads within the same batch. 
Also, batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException that is the root cause of 
this issue.

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  In 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/6/17 12:18 AM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  In 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not 

[jira] [Updated] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N updated SPARK-22163:
--
Description: 
The application objects can contain List and can be modified dynamically as 
well.   However, Spark Streaming framework asynchronously serializes the 
application's objects as the application runs.  Therefore, it causes random 
run-time exception on the List when Spark Streaming framework happens to 
serializes the application's objects while the application modifies a List in 
its own object.  

In fact, there are multiple bugs reported about

Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject

that are permutation of the same root cause. So the design issue of Spark 
streaming framework is that it should do this serialization asynchronously.  
Instead, it should either

1. do this serialization synchronously. This is preferred to eliminate the 
issue completely.  Or

2. Allow it to be configured per application whether to do this serialization 
synchronously or asynchronously, depending on the nature of each application.

Also, Spark documentation should describe the conditions that trigger Spark to 
do this type of serialization asynchronously, so the applications can work 
around them until the fix is provided. 

===

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

—

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for thread #3 until it is done. => So 
there is already synchronization among the threads within the same batch. Also, 
batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch. In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException that is the root cause of 
this issue.

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers. In 
thread #1 the driver could have updated integer X and before it could update 
integer Y, when Spark's thread #4 asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data. 
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1. That way, it is up to individual applications to decide based on the nature 
of their business requirements and needed throughput.



  was:
The application objects can contain List and can be modified dynamically as 
well.   However, Spark Streaming framework asynchronously serializes the 
application's objects as the application runs.  Therefore, it causes random 
run-time exception on the List when Spark Streaming framework happens to 
serializes the application's objects while the application modifies a List in 
its own object.  

In fact, there are multiple bugs reported about

Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject

that are permutation of the same root cause. So the design issue of Spark 
streaming framework is that it should do this serialization asynchronously.  
Instead, it should either

1. do this serialization 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 11:47 PM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  in 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue of false positive condition where 
the serialized checkpoint data has partial correct data.

One solution for both issues is the modify Spark's design to allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 11:42 PM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the driver 
could be modifying multiple native objects say two integers.  The driver could 
have updated integer X and before it could update integer Y,  Spark's  thread 
#4  asynchronous serializes the application objects. So the persisted 
serialized data does not match with the actual data.  So a permutation of this 
issue is false positive condition with partial correct data.

One solution for both issues is the modify Spark's design to allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous.  That way, it is up 
to individual applications to decide based on the nature of their business 
requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the driver 
could be modifying multiple native objects say two integers.  The driver could 
have updated integer X and before it 

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193923#comment-16193923
 ] 

Michael N commented on SPARK-21999:
---

Vadim, It relates to serialization. I confirmed that it is due to list of 
objects being modified and also serialized by Spark's own different threads in 
the same batch.   The details are posted at ticket 
https://issues.apache.org/jira/browse/SPARK-22163, because it involves Spark's 
design and not code.

Steve, I understand your points of check pointing the application. This is why 
in ticket https://issues.apache.org/jira/browse/SPARK-22163, I propose that we 
allow the Spark's serialization of application objects configurable to be 
either asynchronous or synchronous. That way,individual applications can decide 
based on the nature of their business requirements and needed throughput.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> 

[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2017-10-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193916#comment-16193916
 ] 

Xiao Li commented on SPARK-18131:
-

cc [~WeichenXu123] 

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N commented on SPARK-22163:
---

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the driver 
could be modifying multiple native objects say two integers.  The driver could 
have updated integer X and before it could update integer Y,  Spark's  thread 
#4  asynchronous serializes the application objects. So the persisted 
serialized data does not match with the actual data.  So a permutation of this 
issue is false positive condition with partial correct data.

One solution is the modify Spark's design to allow the serialization of 
application objects by Spark's  thread #4 to be configurable per application to 
be either asynchronous or synchronous.  That way, it is up to individual 
applications to decide based on the nature of their business requirements.


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2017-10-05 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193891#comment-16193891
 ] 

Miao Wang commented on SPARK-18131:
---

[~felixcheung] We got stuck at the data types definitions. There is a cycle in 
dependencies for ML data types. Long time ago, I talked with [~smilegator] 
about this issue and he suggested moving the data types to from ML module to 
the parent module. But we didn't discuss details and there was no design 
document either. We need some updates on this issue to add this support in R.

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193840#comment-16193840
 ] 

Michael N edited comment on SPARK-21999 at 10/5/17 10:33 PM:
-

Steve, I'd keep personal opinions of another person separate from assessment of 
his professional work even for open source projects. I understand that 
contributors to Apache open source projects are not paid. However, they should 
still follow best practices and do due diligence on tickets as if they do for 
their paid jobs.

The issue is that Sean used his position with Spark to keep closing tickets 
that he either does not understand fully or does not have the answers for.  
Note that there are many other closed tickets with different permutation of 
ConcurrentModificationException that other people submitted.  So this ticket is 
not an isolated instance.  Image he also does the same thing for tickets with 
his paid job at an employer. Would they let him do that ? 

For instance, for your paid project with your employer, you open a ticket for 
an issue you encounter, and he keeps closing it without understanding the issue 
or providing the answers to the questions in the ticket. Then he tries to block 
you by asking JIRA to stop you from posting the ticket. Would you be Ok with 
that ?

About your other points, I already modified my code to get around this issue. 
However, for the longer term, I think for the Spark design that asynchronously 
serializes application objects for stream applications that runs continuously 
from batch to batch, that design should be changed. That was why I created 
ticket https://issues.apache.org/jira/browse/SPARK-22163 so it could be 
discussed there.  I am open to hearing explanations as to why the current 
design was done the way it is, which was why I posted questions about

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch synchronously ?

But Sean did not provide the answers and instead just kept closing that ticket. 
 If he does not know the answers or information for tickets, he should let 
someone else who has such information answers them.


was (Author: michaeln_apache):
Steve, I'd keep personal opinions of another person separate from assessment of 
his professional work even for open source projects. I understand that 
contributors to Apache open source projects are not paid. However, they should 
still follow best practices and do due diligence on tickets as if they do for 
their paid jobs.

The issue is that Sean used his position with Spark to keep closing tickets 
that he either does not understand fully or does not have the answers for.  
Note that there are many other closed tickets with different permutation of 
ConcurrentModificationException that other people submitted.  So this ticket is 
not an isolated instance.  Image he also does the same thing for tickets with 
his paid job at an employer. Would they let him do that ? 

For instance, for your paid project with your employer, you open a ticket for 
an issue you encounter, and he keeps closing it without understanding the issue 
or providing the answers to the questions in the ticket. Then he tries to block 
you by asking JIRA to stop you from posting the ticket. Would you be Ok with 
that ?

About your other points, I already modified my code to get around this issue. 
However, for the longer term, I think for the Spark design that asynchronously 
serializes application objects for stream applications that runs continuously 
from batch to batch, that design should be changed. That was why I created 
ticket https://issues.apache.org/jira/browse/SPARK-22163 so it could be 
discussed there.  I am open to hearing explanations as to why the current 
design was done the way it is, which was why I posted questions about

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch synchronously ?

But Sean did not provide the answers and instead just closed that ticket.  

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it 

[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-05 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193844#comment-16193844
 ] 

yuhao yang commented on SPARK-22195:


Thanks for the feedback.

I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill 
the two scenarios:
1. Compute cosine similarity between two arbitrary vectors.
2. Compute cosine similarity between one vector and a group of other Vectors 
(usually candidates).

And again, not everyone using Spark ML know how to implement cosine similarity. 


> Add cosine similarity to org.apache.spark.ml.linalg.Vectors
> ---
>
> Key: SPARK-22195
> URL: https://issues.apache.org/jira/browse/SPARK-22195
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://en.wikipedia.org/wiki/Cosine_similarity:
> As the most important measure of similarity, I found it quite useful in some 
> image and NLP applications according to personal experience.
> Suggest to add function for cosine similarity in 
> org.apache.spark.ml.linalg.Vectors.
> Interface:
>   def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
>   def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
> Double = ...
> Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193840#comment-16193840
 ] 

Michael N commented on SPARK-21999:
---

Steve, I'd keep personal opinions of another person separate from assessment of 
his professional work even for open source projects. I understand that 
contributors to Apache open source projects are not paid. However, they should 
still follow best practices and do due diligence on tickets as if they do for 
their paid jobs.

The issue is that Sean used his position with Spark to keep closing tickets 
that he either does not understand fully or does not have the answers for.  
Note that there are many other closed tickets with different permutation of 
ConcurrentModificationException that other people submitted.  So this ticket is 
not an isolated instance.  Image he also does the same thing for tickets with 
his paid job at an employer. Would they let him do that ? 

For instance, for your paid project with your employer, you open a ticket for 
an issue you encounter, and he keeps closing it without understanding the issue 
or providing the answers to the questions in the ticket. Then he tries to block 
you by asking JIRA to stop you from posting the ticket. Would you be Ok with 
that ?

About your other points, I already modified my code to get around this issue. 
However, for the longer term, I think for the Spark design that asynchronously 
serializes application objects for stream applications that runs continuously 
from batch to batch, that design should be changed. That was why I created 
ticket https://issues.apache.org/jira/browse/SPARK-22163 so it could be 
discussed there.  I am open to hearing explanations as to why the current 
design was done the way it is, which was why I posted questions about

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch synchronously ?

But Sean did not provide the answers and instead just closed that ticket.  

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> 

[jira] [Comment Edited] (SPARK-21742) BisectingKMeans generate different models with/without caching

2017-10-05 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193691#comment-16193691
 ] 

Ilya Matiach edited comment on SPARK-21742 at 10/5/17 9:12 PM:
---

[~podongfeng] The test was just validating that the edge case was hit, even if 
it fails the algorithm may be fine.  For bisecting k-means generating 2 or 3 
clusters is fine, please see documentation here:

https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html

Specifically:
param: k the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.

The fact that the test is failing means that caching the dataset is slightly 
changing the data representation, either the ordering of the rows or the exact 
values, in which case k-means may not be hitting the edge case in the test 
where there are no divisible leaf clusters.  This is totally fine, it just 
means that you shouldn't be writing such a test, or you should find a slightly 
different cached dataset that does hit the issue to validate that the bug is 
indeed fixed and bisecting k-means returns fewer than k clusters but does not 
error out (which it was incorrectly doing previously - failing with a cryptic 
error message).


was (Author: imatiach):
[~podongfeng] The test was just validating that the edge case was hit, even if 
it fails the algorithm may be fine.  For bisecting k-means generating 2 or 3 
clusters is fine, please see documentation here:

https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html

Specifically:
param: k the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.

The fact that the test is failing means that caching the dataset is slightly 
changing the data representation, either the ordering of the rows or the exact 
values, in which case k-means may not be hitting the edge case in the test 
where there are no divisible leaf clusters.  This is totally fine, it just 
means that you shouldn't be writing such a test, or you should find a slightly 
different cached dataset that does hit the issue to validating that the bug is 
indeed fixed and bisecting k-means returns fewer than k clusters but does not 
error out (which it was incorrectly doing previously - failing with a cryptic 
error message).

> BisectingKMeans generate different models with/without caching
> --
>
> Key: SPARK-21742
> URL: https://issues.apache.org/jira/browse/SPARK-21742
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input 
> is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
> cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val nnz = 130
> val bkm = new 
> BisectingKMeans().setK(5).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val random = new Random(seed)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
> random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
> Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> scala> bkm.fit(sparseDataset).clusterCenters
> 17/08/16 17:12:28 WARN BisectingKMeans: The input RDD 579 is not directly 
> cached, which may hurt performance if its parent RDDs are also not cached.
> res22: Array[org.apache.spark.ml.linalg.Vector] = 
> Array([0.0,0.0,0.0,0.0,0.0,0.0,0.3081569145071915,0.0,0.0,0.0,0.0,0.1875176493190393,0.0,0.0,0.0,0.33856517726920116,0.0,0.15290274761955236,0.0,0.10820818064086901,0.0,0.0,0.5987249128746422,0.0,0.0,0.3563390364518392,0.0,0.5019914247361699,0.0,0.08711412551574785,0.09199053071837167,0.05749771404790841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5209441786832834,0.0,0.2350595158678447,0.0,0.0,0.0,0.0,0.0,0.0,0.3041334669892575,0.0,0.0,0.32422664760898434,0.0,0.2454271812974,0.0,0.0,0.06846136418797384,0.0,0.0,0.19556839035017104,0.0,0.0,0.08436120694800427,0.0,0.0,0.0,0.30542501045554465,0.0,0.0,0.0,0.16185204843664616,0.2800921624973247,0.0,0.45459861318444555,0.0,0.0,0.0,0.26222502250076374,0.5235099131919367,0.0,0.0,0
> scala> bkm.fit(sparseDataset).clusterCenters.length
> 17/08/16 17:12:36 WARN 

[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

2017-10-05 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193691#comment-16193691
 ] 

Ilya Matiach commented on SPARK-21742:
--

[~podongfeng] The test was just validating that the edge case was hit, even if 
it fails the algorithm may be fine.  For bisecting k-means generating 2 or 3 
clusters is fine, please see documentation here:

https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html

Specifically:
param: k the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.

The fact that the test is failing means that caching the dataset is slightly 
changing the data representation, either the ordering of the rows or the exact 
values, in which case k-means may not be hitting the edge case in the test 
where there are no divisible leaf clusters.  This is totally fine, it just 
means that you shouldn't be writing such a test, or you should find a slightly 
different cached dataset that does hit the issue to validating that the bug is 
indeed fixed and bisecting k-means returns fewer than k clusters but does not 
error out (which it was incorrectly doing previously - failing with a cryptic 
error message).

> BisectingKMeans generate different models with/without caching
> --
>
> Key: SPARK-21742
> URL: https://issues.apache.org/jira/browse/SPARK-21742
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input 
> is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
> cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val nnz = 130
> val bkm = new 
> BisectingKMeans().setK(5).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val random = new Random(seed)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
> random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
> Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> scala> bkm.fit(sparseDataset).clusterCenters
> 17/08/16 17:12:28 WARN BisectingKMeans: The input RDD 579 is not directly 
> cached, which may hurt performance if its parent RDDs are also not cached.
> res22: Array[org.apache.spark.ml.linalg.Vector] = 
> Array([0.0,0.0,0.0,0.0,0.0,0.0,0.3081569145071915,0.0,0.0,0.0,0.0,0.1875176493190393,0.0,0.0,0.0,0.33856517726920116,0.0,0.15290274761955236,0.0,0.10820818064086901,0.0,0.0,0.5987249128746422,0.0,0.0,0.3563390364518392,0.0,0.5019914247361699,0.0,0.08711412551574785,0.09199053071837167,0.05749771404790841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5209441786832834,0.0,0.2350595158678447,0.0,0.0,0.0,0.0,0.0,0.0,0.3041334669892575,0.0,0.0,0.32422664760898434,0.0,0.2454271812974,0.0,0.0,0.06846136418797384,0.0,0.0,0.19556839035017104,0.0,0.0,0.08436120694800427,0.0,0.0,0.0,0.30542501045554465,0.0,0.0,0.0,0.16185204843664616,0.2800921624973247,0.0,0.45459861318444555,0.0,0.0,0.0,0.26222502250076374,0.5235099131919367,0.0,0.0,0
> scala> bkm.fit(sparseDataset).clusterCenters.length
> 17/08/16 17:12:36 WARN BisectingKMeans: The input RDD 667 is not directly 
> cached, which may hurt performance if its parent RDDs are also not cached.
> res23: Int = 2
> scala> sparseDataset.persist()
> res24: sparseDataset.type = [features: vector]
> scala> bkm.fit(sparseDataset).clusterCenters
> 17/08/16 17:14:35 WARN BisectingKMeans: The input RDD 806 is not directly 
> cached, which may hurt performance if its parent RDDs are also not cached.
> res26: Array[org.apache.spark.ml.linalg.Vector] = 
> 

[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2017-10-05 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193675#comment-16193675
 ] 

Ilya Matiach commented on SPARK-16473:
--

[~podongfeng] interesting - it looks like the dataset representation is somehow 
changing when it is cached?  My guess is that the row order may be changing or 
the numeric values may be changing?  The test failure itself is ok if the 
number of clusters is equal to k (which is actually perfectly fine for the 
algorithm), it just means that the dataset was not generated correctly to hit 
the very special edge case I was looking for, where one cluster is empty after 
a split in bisecting k-means.  I can't seem to see the test failure error 
message in your PR, could you run another build and post it here?  We may need 
to add some debugging/print statements everywhere to determine how the data is 
changing when you cache it - this doesn't mean there is any bug in the 
algorithm, it just means the test needs to be changed so that the test data, 
even after caching, is the same as the original one.

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>Assignee: Ilya Matiach
> Fix For: 2.1.1, 2.2.0
>
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2017-10-05 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22210:
--

 Summary: Online LDA variationalTopicInference  should use random 
seed to have stable behavior
 Key: SPARK-22210
 URL: https://issues.apache.org/jira/browse/SPARK-22210
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582

Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22188:
--
Affects Version/s: 2.0.2
   2.1.1

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Krishna Pandey
>Priority: Minor
>  Labels: security
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22209) PySpark does not recognize imports from submodules

2017-10-05 Thread Joel Croteau (JIRA)
Joel Croteau created SPARK-22209:


 Summary: PySpark does not recognize imports from submodules
 Key: SPARK-22209
 URL: https://issues.apache.org/jira/browse/SPARK-22209
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK 
1.8, Centos 6
Reporter: Joel Croteau
Priority: Minor


Using submodule syntax inside a PySpark job seems to create issues. For 
example, the following:

{code:python}
import scipy.sparse
from pyspark import SparkContext, SparkConf


def do_stuff(x):
y = scipy.sparse.dok_matrix((1, 1))
y[0, 0] = x
return y[0, 0]


def init_context():
conf = SparkConf().setAppName("Spark Test")
sc = SparkContext(conf=conf)
return sc


def main():
sc = init_context()
data = sc.parallelize([1, 2, 3, 4])
output_data = data.map(do_stuff)
print(output_data.collect())


__name__ == '__main__' and main()
{code}
produces this error:
{noformat}
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
  File 
"/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
 line 177, in main
process()
  File 
"/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
 line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
 line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/home/jcroteau/is/pel_selection/test_sparse.py", line 6, in dostuff
y = scipy.sparse.dok_matrix((1, 1))
AttributeError: module 'scipy' has no attribute 'sparse'

at 

[jira] [Updated] (SPARK-22200) Kinesis Receivers stops if Kinesis stream was re-sharded

2017-10-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22200:
-
Component/s: (was: Spark Core)
 DStreams

> Kinesis Receivers stops if Kinesis stream was re-sharded
> 
>
> Key: SPARK-22200
> URL: https://issues.apache.org/jira/browse/SPARK-22200
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Alex Mikhailau
>Priority: Critical
>
> Seeing 
> Cannot find the shard given the shardId shardId-4454
> Cannot get the shard for this ProcessTask, so duplicate KPL user records in 
> the event of resharding will not be dropped during deaggregation of Amazon 
> Kinesis records.
> after Kinesis stream re-sharding and receivers stop working altogether.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform streaming dataframes

2017-10-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193392#comment-16193392
 ] 

Bago Amirbekian commented on SPARK-21926:
-

[~mslipper] The trickiest thing about 1 (b) is knowing how to test that it 
won't change behaviour. I'd like run this past some folks with more MLlib 
experience to see if there are any obvious issues with this approach that we 
haven't considered.

> Some transformers in spark.ml.feature fail when trying to transform streaming 
> dataframes
> 
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-10-05 Thread Sayat Satybaldiyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193391#comment-16193391
 ] 

Sayat Satybaldiyev commented on SPARK-22077:


yes, when I enclose IPv6 with square brackets then URL parses without any 
errors. 

> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> -
>
> Key: SPARK-22077
> URL: https://issues.apache.org/jira/browse/SPARK-22077
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> For example, 
> sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"
> is parsed as:
> host = null
> port = -1
> name = null
> While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.
> This is happening on our production machines and causing spark to not start 
> up.
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
>   at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>   at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
>   at org.apache.spark.executor.Executor.(Executor.scala:121)
>   at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
>   at org.apache.spark.SparkContext.(SparkContext.scala:507)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193390#comment-16193390
 ] 

Apache Spark commented on SPARK-21871:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19440

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Critical
> Fix For: 2.3.0
>
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193389#comment-16193389
 ] 

Bago Amirbekian commented on SPARK-13030:
-

Just so I'm clear, does multi-column in this context mean apply one-hot-encoder 
to each column and then join the resulting vectors?

How do you all feel about giving the new OneHotEncoder the same `handleInvalid` 
semantics as StringIndexer?

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193355#comment-16193355
 ] 

Andrew Ash commented on SPARK-20055:


What I would find most useful is a list of available options and their 
behaviors, like https://github.com/databricks/spark-csv#features

Even though that project has been in-lined into Apache Spark, that github page 
is still the best reference for csv options

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15682) Hive partition write looks for root hdfs folder for existence

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15682:
--
Summary: Hive partition write looks for root hdfs folder for existence  
(was: Hive ORC partition write looks for root hdfs folder for existence)

> Hive partition write looks for root hdfs folder for existence
> -
>
> Key: SPARK-15682
> URL: https://issues.apache.org/jira/browse/SPARK-15682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Dipankar
>
> Scenario:
> I am using the below program to create new partition based on the current 
> date which signifies the run date.
> However, it fails citing hdfs folder already exists. It checks the root 
> folder and not new partition value.
> Is partitionBy clause actually not checking the hive metastore or folder till 
> proc_date= some value. ? and it's just a way to create folders based on 
> partition key. Not any way related to hive partition ??
> Alternatively, should i use
> result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30")
>  to achieve the result.
> But this will not update hive metastore with new partition details.
> Is spark orc support not equivalent to HCatStorer API?
> My hive table is built with proc_date as partition column. 
> Source code :
> result.registerTempTable("result_tab")
> val result_partition = sqlContext.sql("FROM result_tab select 
> *,'"+curr_date+"' as proc_date")
> result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc")
> Exception
> 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select 
> *,'2016-05-31' as proc_date
> 16/05/31 15:57:34 INFO ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already 
> exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at SampleApp$.main(SampleApp.scala:31)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14286) Empty ORC table join throws exception

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14286:
--
Component/s: SQL

> Empty ORC table join throws exception
> -
>
> Key: SPARK-14286
> URL: https://issues.apache.org/jira/browse/SPARK-14286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> When joining with an empty ORC table, sparks throws following exception. 
> {noformat}
> java.sql.SQLException: java.lang.IllegalArgumentException: orcFileOperator: 
> path /apps/hive/warehouse/test.db/table does not have valid orc files 
> matching the pattern
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14286) Empty ORC table join throws exception

2017-10-05 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193344#comment-16193344
 ] 

Dongjoon Hyun commented on SPARK-14286:
---

Hi, [~rajesh.balamohan].
How can I reproduce this? When I tested the following in Spark 1.6.3, 2.0.2, 
2.1.1, 2.2.0, all passes.

{code}
scala> sql("create table x(a int) stored as orc")
res4: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("create table y(a int) stored as orc")
res5: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("select * from x, y where x.a=y.a").show
+---+---+
|  a|  a|
+---+---+
+---+---+
{code}

> Empty ORC table join throws exception
> -
>
> Key: SPARK-14286
> URL: https://issues.apache.org/jira/browse/SPARK-14286
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> When joining with an empty ORC table, sparks throws following exception. 
> {noformat}
> java.sql.SQLException: java.lang.IllegalArgumentException: orcFileOperator: 
> path /apps/hive/warehouse/test.db/table does not have valid orc files 
> matching the pattern
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-17047.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

This is resolved by SPARK-17729.

{code}
scala> sql("""
 | CREATE TABLE t2(
 | ID INT
 | , CLUSTERED INT
 | , SCATTERED INT
 | , RANDOMISED INT
 | , RANDOM_STRING VARCHAR(50)
 | , SMALL_VC VARCHAR(10)
 | , PADDING VARCHAR(10)
 | )
 | CLUSTERED BY (ID) INTO 256 BUCKETS
 | STORED AS ORC
 | TBLPROPERTIES ( "orc.compress"="SNAPPY",
 | "orc.create.index"="true",
 | "orc.bloom.filter.columns"="ID",
 | "orc.bloom.filter.fpp"="0.05",
 | "orc.stripe.size"="268435456",
 | "orc.row.index.stride"="1" )
 | """)

scala> spark.version
res3: String = 2.3.0-SNAPSHOT
{code}

> Spark 2 cannot create table when CLUSTERED.
> ---
>
> Key: SPARK-17047
> URL: https://issues.apache.org/jira/browse/SPARK-17047
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Dr Mich Talebzadeh
> Fix For: 2.3.0
>
>
> This does not work with CLUSTERED BY clause in Spark 2 now!
> CREATE TABLE test.dummy2
>  (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
> CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1" )
> scala> HiveContext.sql(sqltext)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17047:
--
Component/s: SQL

> Spark 2 cannot create table when CLUSTERED.
> ---
>
> Key: SPARK-17047
> URL: https://issues.apache.org/jira/browse/SPARK-17047
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Dr Mich Talebzadeh
> Fix For: 2.3.0
>
>
> This does not work with CLUSTERED BY clause in Spark 2 now!
> CREATE TABLE test.dummy2
>  (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
> CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1" )
> scala> HiveContext.sql(sqltext)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17047:
--
Affects Version/s: 2.1.1
   2.2.0

> Spark 2 cannot create table when CLUSTERED.
> ---
>
> Key: SPARK-17047
> URL: https://issues.apache.org/jira/browse/SPARK-17047
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Dr Mich Talebzadeh
> Fix For: 2.3.0
>
>
> This does not work with CLUSTERED BY clause in Spark 2 now!
> CREATE TABLE test.dummy2
>  (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
> CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1" )
> scala> HiveContext.sql(sqltext)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.

2017-10-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17047:
--
Summary: Spark 2 cannot create table when CLUSTERED.  (was: Spark 2 cannot 
create ORC table when CLUSTERED.)

> Spark 2 cannot create table when CLUSTERED.
> ---
>
> Key: SPARK-17047
> URL: https://issues.apache.org/jira/browse/SPARK-17047
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Dr Mich Talebzadeh
>
> This does not work with CLUSTERED BY clause in Spark 2 now!
> CREATE TABLE test.dummy2
>  (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
> CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1" )
> scala> HiveContext.sql(sqltext)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193247#comment-16193247
 ] 

holdenk commented on SPARK-22202:
-

[~felixcheung] for Python I think it would not be bad to be consistent, but I'd 
probably put it at a trivial rather than major level personally. The fix could 
be the same for both (e.g. create tgz's _then_ run python/r packaging) so I 
think keeping it together is fine.

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21785) Support create table from a file schema

2017-10-05 Thread Jacky Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193209#comment-16193209
 ] 

Jacky Shen edited comment on SPARK-21785 at 10/5/17 5:28 PM:
-

Yes, we can create table from existing parquet file and it is a good solution.

but for the request itself, the user want to create table with the schema from 
the given parquet file and specify a different location to the table. Maybe 
more other table properties.
that's the "create table ... using parquet" cannot achieve.


{code:java}
spark-sql> create table table2 using parquet options(path 
'test-data/dec-in-fixed-len.parquet') location '/tmp/table1';
Error in query: 
LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, 
you can only specify one of them.(line 1, pos 0)

{code}



was (Author: jacshen):
Yes, we can create table from existing parquet file and it is a good solution.

but for the request itself, the user want to create table with the schema from 
the given parquet file and specify a different location to the table. Maybe 
more other table properties.
that's the "create table ... using parquet" cannot achieve.


{code:java}
spark-sql> create table table2 using parquet options(path 
'/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet')
 location '/tmp/table1';
Error in query: 
LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, 
you can only specify one of them.(line 1, pos 0)

{code}


> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193239#comment-16193239
 ] 

Felix Cheung commented on SPARK-22202:
--

[~holden.ka...@gmail.com] would you be concerned with the python differences?
if not, I'll turn this into just for R.

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193238#comment-16193238
 ] 

Felix Cheung commented on SPARK-22202:
--

Yes, exactly.

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21785) Support create table from a file schema

2017-10-05 Thread Jacky Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193209#comment-16193209
 ] 

Jacky Shen edited comment on SPARK-21785 at 10/5/17 5:12 PM:
-

Yes, we can create table from existing parquet file and it is a good solution.

but for the request itself, the user want to create table with the schema from 
the given parquet file and specify a different location to the table. Maybe 
more other table properties.
that's the "create table ... using parquet" cannot achieve.


{code:java}
spark-sql> create table table2 using parquet options(path 
'/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet')
 location '/tmp/table1';
Error in query: 
LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, 
you can only specify one of them.(line 1, pos 0)

{code}



was (Author: jacshen):
Yes, we can create table from existing parquet file and it is a good solution.

but for the request itself, the user want to create table with the schema from 
the given parquet file and specify a different location to the table. Maybe 
more other table properties.
that's the "create table ... using parquet" cannot achieve.


{code:sql}
spark-sql> create table table2 using parquet options(path 
'/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet')
 location '/tmp/table1';
Error in query: 
LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, 
you can only specify one of them.(line 1, pos 0)

{code}


> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21785) Support create table from a file schema

2017-10-05 Thread Jacky Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193209#comment-16193209
 ] 

Jacky Shen commented on SPARK-21785:


Yes, we can create table from existing parquet file and it is a good solution.

but for the request itself, the user want to create table with the schema from 
the given parquet file and specify a different location to the table. Maybe 
more other table properties.
that's the "create table ... using parquet" cannot achieve.


{code:sql}
spark-sql> create table table2 using parquet options(path 
'/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet')
 location '/tmp/table1';
Error in query: 
LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, 
you can only specify one of them.(line 1, pos 0)

{code}


> Support create table from a file schema
> ---
>
> Key: SPARK-21785
> URL: https://issues.apache.org/jira/browse/SPARK-21785
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: liupengcheng
>  Labels: sql
>
> Current spark doest not support creating a table from a file schema
> for example:
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET 
> '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION
> '/user/test/def/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21866) SPIP: Image support in Spark

2017-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21866:


Assignee: Apache Spark

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about 

[jira] [Assigned] (SPARK-21866) SPIP: Image support in Spark

2017-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21866:


Assignee: (was: Apache Spark)

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. 

[jira] [Resolved] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage

2017-10-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22179.
---
Resolution: Duplicate

OK I think this effectively taken over by the new issue?

> percentile_approx should choose the first element if it already reaches the 
> percentage
> --
>
> Key: SPARK-22179
> URL: https://issues.apache.org/jira/browse/SPARK-22179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx should choose the first element if it already reaches the 
> percentage.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1 (instead of 2), because the first value 1 
> already reaches 10%. Currently it returns a wrong answer: 2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193212#comment-16193212
 ] 

Apache Spark commented on SPARK-21866:
--

User 'imatiach-msft' has created a pull request for this issue:
https://github.com/apache/spark/pull/19439

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string 

[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-10-05 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193201#comment-16193201
 ] 

Kazuaki Ishizaki commented on SPARK-19984:
--

[~JohnSteidley] Thank you for your comment. Now, I updated my program. Then, I 
got the same physical plan as what you provided. However, my program works on 
Spark 2.1.2 RC and master branch.
While the physical plan tried to {{SortMergeJoin}} two strings, the generated 
code seems to {{SortMergeJoin}} string and long value for {{count(1)}}. That 
strange {{SortMergeJoin}} is what I cannot understand.

{code}
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#35L])
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#40L])
   +- *Project
  +- *SortMergeJoin [A#25], [A#20], Inner
 :- *Sort [A#25 ASC NULLS FIRST], false, 0
 :  +- Exchange hashpartitioning(A#25, 1)
 : +- *Project [id#16 AS A#25]
 :+- *Filter isnotnull(id#16)
 :   +- *FileScan parquet [id#16] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], 
PushedFilters: [IsNotNull(id)], ReadSchema: struct
 +- *Sort [A#20 ASC NULLS FIRST], false, 0
+- *Project [id#16 AS A#20]
   +- *Filter isnotnull(id#16)
  +- *GlobalLimit 3
 +- Exchange SinglePartition
+- *LocalLimit 3
   +- *FileScan parquet [id#16] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
{code}

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
> Attachments: after_adding_count.txt, before_adding_count.txt
>
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows1;
> /* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime1;
> /* 033 */   private UnsafeRow agg_result1;
> /* 034 */   private 
> 

[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-10-05 Thread John Steidley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193120#comment-16193120
 ] 

John Steidley commented on SPARK-19984:
---

[~kiszk] I noticed one other difference in our plans: yours has count(A#52) 
mine has count(1). Could that be significant?

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
> Attachments: after_adding_count.txt, before_adding_count.txt
>
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows1;
> /* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime1;
> /* 033 */   private UnsafeRow agg_result1;
> /* 034 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1;
> /* 035 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter1;
> /* 036 */
> /* 037 */   public GeneratedIterator(Object[] references) {
> /* 038 */ this.references = references;
> /* 039 */   }
> /* 040 */
> /* 041 */   public void init(int index, scala.collection.Iterator[] inputs) {
> /* 042 */ partitionIndex = index;
> /* 043 */ this.inputs = inputs;
> /* 044 */ wholestagecodegen_init_0();
> /* 045 */ wholestagecodegen_init_1();
> /* 046 */
> /* 047 */   }
> /* 048 */
> /* 049 */   private void wholestagecodegen_init_0() {
> /* 050 */ agg_initAgg = false;
> /* 051 */
> /* 052 */ agg_initAgg1 = false;
> /* 053 */
> /* 054 */ smj_leftInput = inputs[0];
> /* 055 */ smj_rightInput = inputs[1];
> /* 056 */
> /* 057 */ smj_rightRow = null;
> /* 058 */
> /* 059 */ smj_matches = new java.util.ArrayList();
> /* 060 */
> /* 061 */ this.smj_numOutputRows = 
> (org.apache.spark.sql.execution.metric.SQLMetric) references[0];
> /* 062 */ smj_result = new UnsafeRow(2);
> /* 063 */ this.smj_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, 
> 64);
> /* 064 */ this.smj_rowWriter = new 
> 

[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193111#comment-16193111
 ] 

Shivaram Venkataraman commented on SPARK-22202:
---

I think the differences happen because we build the CRAN package from one of 
the Hadoop versions ? 

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2017-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193078#comment-16193078
 ] 

Apache Spark commented on SPARK-22208:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19438

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2017-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22208:


Assignee: Apache Spark

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2017-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22208:


Assignee: (was: Apache Spark)

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2017-10-05 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-22208:


 Summary: Improve percentile_approx by not rounding up targetError 
and starting from index 0
 Key: SPARK-22208
 URL: https://issues.apache.org/jira/browse/SPARK-22208
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang


percentile_approx never returns the first element when percentile is in 
(relativeError, 1/N], where relativeError default is 1/1, and N is the 
total number of elements. But ideally, percentiles in [0, 1/N] should all 
return the first element as the answer.

For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1, because the first value 1 already reaches 10%. 
Currently it returns 2.

Based on the paper, targetError is not rounded up, and searching index should 
start from 0 instead of 1. By following the paper, we should be able to fix the 
cases mentioned above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22206) gapply in R can't work on empty grouping columns

2017-10-05 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22206:


Assignee: Liang-Chi Hsieh

> gapply in R can't work on empty grouping columns
> 
>
> Key: SPARK-22206
> URL: https://issues.apache.org/jira/browse/SPARK-22206
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.1, 2.3.0, 2.1.3
>
>
> {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but 
> {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty 
> grouping attributes. So {{gapply}} can't work on empty grouping columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22206) gapply in R can't work on empty grouping columns

2017-10-05 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22206.
--
   Resolution: Fixed
Fix Version/s: 2.1.3
   2.3.0
   2.2.1

Issue resolved by pull request 19436
[https://github.com/apache/spark/pull/19436]

> gapply in R can't work on empty grouping columns
> 
>
> Key: SPARK-22206
> URL: https://issues.apache.org/jira/browse/SPARK-22206
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.2.1, 2.3.0, 2.1.3
>
>
> {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but 
> {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty 
> grouping attributes. So {{gapply}} can't work on empty grouping columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192938#comment-16192938
 ] 

cold gin commented on SPARK-22201:
--

Ok, and thank you, I appreciate your time and feedback. Having the numeric 
columns automatically pre-selected by default would make for a more robust api 
imo, (ie - no column list to supply by default). What you said about pandas 
having a parameter to include strings seems to support the default behavior of 
numeric columns only also.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>Priority: Minor
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22201:
--
 Flags:   (was: Patch)
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I'm neutral on it because it's just the default and anyone who cares can always 
select the exact columns that are desired. At best it's an enhancement request, 
but not sure it will be taken up.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>Priority: Minor
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192732#comment-16192732
 ] 

Steve Loughran edited comment on SPARK-21999 at 10/5/17 1:39 PM:
-

Apache projects are all open source, with an open community. The ASF welcomes 
contributions, and does request that people follow an etiquette of constructive 
discourse: https://community.apache.org/contributors/etiquette

Personally, I've always found Sean to be helpful and polite, even when I've 
turned out to be utterly wrong. I'd advise listening to his feedback, and not 
view it as a personal criticism. Whatever problem you have, faulting him isn't 
going to fix it.

Like I said, projects welcomes contributors and their contributions. 

# One key area where people can get their hands dirty in a project is 
documentation. Do you think some content in the spark streaming guide could be 
improved? Why not add that section & submit a patch for review?
# Because you'll need the latest source tree to hand to do it, before you even 
worry about documenting a problem, why not see if it "has gone away", or at 
least "moved somewhere else"? Working with the latest versions of the code may 
seem leading edge, but its the best way to get your problems fixed, and the 
only way to pick up an immediate patch.
# For any issue in a project, it doesn't get fixed ASAP, at least not in a form 
you can pick up and use unless you have that build environment yourself. If you 
are using a software product built on spark, well, you'll need to wait for that 
supplier to update their release, which happens on their release schedule. For 
the ASF releases, there's a release process and timetable.
# Which means: for now, you have to come up with a solution to your problem. 
Given it's related to structure serialization, well, you can implement your own 
{{readObject}} and {{writeObject}} methods, and so perhaps wrap the structures 
being serialized with something which implements a broader locking structure 
across your data. It's what I'd try first, as it would be an interesting little 
problem: A datastructure which can have mutating operations invoked while 
serialization is in progress. Testing this would be fun...you'd need some kind 
of mock OutputStream which could block on a write() so you could guarantee the 
writer thread was blocking at the correct place for the mutation call to 
trigger the failure.

As to whether it's architectural? That's something which could be debated. 
Checkpointing the state of streaming applications is one of those challenging 
problems, especially if the notion of "where you are" across multiple input 
streams is hard, and, if the cost of preserving state is high, any attempt to 
block for the operation will hurt throughput. And changing checkpoint 
architectures is so fundamental that its not some one-line patch. 
If you can fix your structures up to be robustly serializable during 
asynchronous writes, you'll get performance as well as robustness.




was (Author: ste...@apache.org):
Apache projects are all open source, with an open community. The ASF welcomes 
contributions, and does request that people follow an etiquette of constructive 
discourse: https://community.apache.org/contributors/etiquette

Personally, I've always found Sean to be helpful and polite, even when I've 
turned out to be utterly wrong. I'd advise listening to his feedback, and not 
view it as a personal criticism. Whatever problem you have, faulting him isn't 
going to fix it.

Like I said, projects welcomes contributors and their contributions. 

# One key area where people can get their hands dirty in a project is 
documentation. Do you think some content in the spark streaming guide could be 
improved? Why not add that section & submit a patch for review?
# Because you'll need the latest source tree to hand to do it, before you even 
worry about documenting a problem, why not see if it "has gone away", or at 
least "moved somewhere else"? Working with the latest versions of the code may 
seem leading edge, but its the best way to get your problems fixed, and the 
only way to pick up an immediate patch.
# For any issue in a project, it doesn't get fixed ASAP, at least not in a form 
you can pick up and use unless you have that build environment yourself. If you 
are using a software product built on spark, well, you'll need to wait for that 
supplier to update their release, which happens on their release schedule. For 
the ASF releases, there's a release process and timetable.
# Which means: for now, you have to come up with a solution to your problem. 
Given it's related to structure serialization, well, you can implement your own 
{{readObject}} and {{writeObject}} methods, and so perhaps wrap the structures 
being serialized with something which implements a broader locking structure 
across your data. It's what I'd try 

[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:24 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
output of the describe() no-args api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns is fine, but should be controlled with 
an includeColTypes parameter, and not included by default imo.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
output of the describe() no-args api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:23 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
output of the describe() no-args api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:22 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
*by default* you must include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
the default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:19 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
*by default* you must include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
the default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 12:58 PM:


Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. It is also exactly the reason that the other person 
in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo 
is fine, but should be controlled with an includeColTypes parameter, and not 
included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 12:58 PM:


Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. It is also exactly the reason that the other person 
in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo 
is fine, but should be controlled with an includeColTypes parameter, and not 
included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* it include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
default input. It is also exactly the reason that the other person in 
SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is 
fine, but should be controlled with an includeColTypes parameter, and not 
included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin commented on SPARK-22201:
--

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* it include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
default input. It is also exactly the reason that the other person in 
SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is 
fine, but should be controlled with an includeColTypes parameter, and not 
included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22131) Add Mesos Secrets Support to the Mesos Driver

2017-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192799#comment-16192799
 ] 

Apache Spark commented on SPARK-22131:
--

User 'susanxhuynh' has created a pull request for this issue:
https://github.com/apache/spark/pull/19437

> Add Mesos Secrets Support to the Mesos Driver
> -
>
> Key: SPARK-22131
> URL: https://issues.apache.org/jira/browse/SPARK-22131
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>
> We recently added Secrets support to the Dispatcher (SPARK-20812). In order 
> to have Driver-to-Executor TLS we need the same support in the Mesos Driver 
> so a secret can be disseminated to the executors. This JIRA is to move the 
> current secrets implementation to be used by both frameworks. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-10-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192732#comment-16192732
 ] 

Steve Loughran commented on SPARK-21999:


Apache projects are all open source, with an open community. The ASF welcomes 
contributions, and does request that people follow an etiquette of constructive 
discourse: https://community.apache.org/contributors/etiquette

Personally, I've always found Sean to be helpful and polite, even when I've 
turned out to be utterly wrong. I'd advise listening to his feedback, and not 
view it as a personal criticism. Whatever problem you have, faulting him isn't 
going to fix it.

Like I said, projects welcomes contributors and their contributions. 

# One key area where people can get their hands dirty in a project is 
documentation. Do you think some content in the spark streaming guide could be 
improved? Why not add that section & submit a patch for review?
# Because you'll need the latest source tree to hand to do it, before you even 
worry about documenting a problem, why not see if it "has gone away", or at 
least "moved somewhere else"? Working with the latest versions of the code may 
seem leading edge, but its the best way to get your problems fixed, and the 
only way to pick up an immediate patch.
# For any issue in a project, it doesn't get fixed ASAP, at least not in a form 
you can pick up and use unless you have that build environment yourself. If you 
are using a software product built on spark, well, you'll need to wait for that 
supplier to update their release, which happens on their release schedule. For 
the ASF releases, there's a release process and timetable.
# Which means: for now, you have to come up with a solution to your problem. 
Given it's related to structure serialization, well, you can implement your own 
{{readObject}} and {{writeObject}} methods, and so perhaps wrap the structures 
being serialized with something which implements a broader locking structure 
across your data. It's what I'd try first, as it would be an interesting little 
problem: A datastructure which can have mutating operations invoked while 
serialization is in progress. Testing this would be fun...you'd need some kind 
of mock OutputStream which could block on a write() so you could guarantee the 
writer thread was blocking at the correct place for the mutation call to 
trigger the failure.

As to whether it's architectural, well, that's something which could be 
debated. Checkpointing the state of streaming applications is one of those 
challenging problems, especially if the notion of "where you are" across 
multiple input streams is hard, and, if the cost of preserving state is high, 
any attempt to block for the operation will hurt throughput. And changing 
checkpoint architectures is so fundamental that its not some one-line patch. 
If you can fix your structures up to be robustly serializable during 
asynchronous writes, you'll get performance as well as robustness.



> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at 

[jira] [Updated] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-22163:
---
Priority: Major  (was: Critical)

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22207) High memory usage when converting relational data to Hierarchical data

2017-10-05 Thread kanika dhuria (JIRA)
kanika dhuria created SPARK-22207:
-

 Summary: High memory usage when converting relational data to 
Hierarchical data
 Key: SPARK-22207
 URL: https://issues.apache.org/jira/browse/SPARK-22207
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: kanika dhuria


Have 4 tables 
lineitems ~1.4Gb,
orders ~ 330MB
customer ~47MB
nations ~ 2.2K

These tables are related as follows
There are multiple lineitems per order (pk, fk:orderkey)
There are multiple orders per customer(pk,fk: cust_key)
There are multiple customers per nation(pk, fk:nation key)

Data is almost evenly distributed.

Building hierarchy till 3 levels i.e joining lineitems, orders, customers works 
good with executor memory 4Gb/2cores
Adding nations require 8GB/2 cores or 4GB/1 core memory.

==

{noformat}
val sqlContext = SparkSession.builder() .enableHiveSupport() 
.config("spark.sql.retainGroupColumns", false) 
.config("spark.sql.crossJoin.enabled", true) .getOrCreate()
 
  val orders = sqlContext.sql("select * from orders")
  val lineItem = sqlContext.sql("select * from lineitems")
  
  val customer = sqlContext.sql("select * from customers")
  
  val nation = sqlContext.sql("select * from nations")
  
  val lineitemOrders = 
lineItem.groupBy(col("l_orderkey")).agg(col("l_orderkey"), 
collect_list(struct(col("l_partkey"), 
col("l_suppkey"),col("l_linenumber"),col("l_quantity"),col("l_extendedprice"),col("l_discount"),col("l_tax"),col("l_returnflag"),col("l_linestatus"),col("l_shipdate"),col("l_commitdate"),col("l_receiptdate"),col("l_shipinstruct"),col("l_shipmode"))).as("lineitem")).join(orders,
 orders("O_ORDERKEY")=== lineItem("l_orderkey")).select(col("O_ORDERKEY"), 
col("O_CUSTKEY"),  col("O_ORDERSTATUS"), col("O_TOTALPRICE"), 
col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), 
col("O_SHIPPRIORITY"), col("O_COMMENT"),  col("lineitem"))  
  
  val customerList = 
lineitemOrders.groupBy(col("o_custkey")).agg(col("o_custkey"),collect_list(struct(col("O_ORDERKEY"),
 col("O_CUSTKEY"),  col("O_ORDERSTATUS"), col("O_TOTALPRICE"), 
col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), 
col("O_SHIPPRIORITY"), 
col("O_COMMENT"),col("lineitem"))).as("items")).join(customer,customer("c_custkey")===
 
lineitemOrders("o_custkey")).select(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))


 val nationList = 
customerList.groupBy(col("c_nationkey")).agg(col("c_nationkey"),collect_list(struct(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))).as("custList")).join(nation,nation("n_nationkey")===customerList("c_nationkey")).select(col("n_nationkey"),col("n_name"),col("custList"))
 
  nationList.write.mode("overwrite").json("filePath")

{noformat}


If the customeList is saved in a file and then the last agg/join is run 
separately, it does run fine in 4GB/2 core .

I can provide the data if needed.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org