[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194170#comment-16194170 ] Takeshi Yamamuro commented on SPARK-22211: -- Probably, the suggested solution does not work when both-side tables are small and a limit value is high. IMHO one option to solve this is just to stop the pushdown for full-outer joins (because I couldn't find better solutions to solve this smartly...). > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194142#comment-16194142 ] Apache Spark commented on SPARK-8515: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19442 > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > Labels: advanced > Attachments: SPARK-8515.pdf > > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8515: --- Assignee: (was: Apache Spark) > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > Labels: advanced > Attachments: SPARK-8515.pdf > > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8515: --- Assignee: Apache Spark > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > Labels: advanced > Attachments: SPARK-8515.pdf > > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194141#comment-16194141 ] Felix Cheung commented on SPARK-22202: -- [~holden.ka...@gmail.com] actually, I think for R we would go the other way - we would want to include what's in hadoop2.6 only in all other release profiles (ie. run *this* then create tgz) so I think the approaches are potentially opposite for R and python. > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung >Priority: Minor > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-22202: - Priority: Minor (was: Major) > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung >Priority: Minor > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194074#comment-16194074 ] Liang-Chi Hsieh edited comment on SPARK-8515 at 10/6/17 4:52 AM: - I'm working on a new ML attribute API which is supposed to solve the issues like SPARK-19141 and SPARK-21926 and maybe SPARK-12886. The basic attribute design is similar to previous API. But new API supports sparse Vector attributes. So we don't need to keep every attributes in a Vector column. The new API is also trimming down some unnecessary parts in previous API. The design doc is attached in this JIRA. cc [~mlnick] was (Author: viirya): I'm working on a new ML attribute API which is supposed to solve the issues like SPARK-19141 and SPARK-21926 and maybe SPARK-12886. The basic attribute design is similar to previous API. But new API supports sparse Vector attributes. So we don't need to keep every attributes in a Vector column. The new API is also trimming down some unnecessary parts in previous API. cc [~mlnick] > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > Labels: advanced > Attachments: SPARK-8515.pdf > > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-8515: --- Attachment: SPARK-8515.pdf > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > Labels: advanced > Attachments: SPARK-8515.pdf > > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194129#comment-16194129 ] Jorge Machado commented on SPARK-20055: --- [~aash] Should I copy paste that options ? And there is some docs already {code:java} def csv(paths: String*): DataFrame Permalink Loads CSV files and returns the result as a DataFrame. This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema. You can set the following CSV-specific options to deal with CSV files: sep (default ,): sets the single character as a separator for each field and value. encoding (default UTF-8): decodes the CSV files by the given encoding type. quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv. escape (default \): sets the single character used for escaping quotes inside an already quoted value. comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled. header (default false): uses the first line as names of columns. inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data. ignoreLeadingWhiteSpace (default false): a flag indicating whether or not leading whitespaces from values being read should be skipped. ignoreTrailingWhiteSpace (default false): a flag indicating whether or not trailing whitespaces from values being read should be skipped. nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type. nanValue (default NaN): sets the string representation of a non-number" value. positiveInf (default Inf): sets the string representation of a positive infinity value. negativeInf (default -Inf): sets the string representation of a negative infinity value. dateFormat (default -MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. timestampFormat (default -MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. maxColumns (default 20480): defines a hard limit of how many columns a record can have. maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields. DROPMALFORMED : ignores the whole corrupted records. FAILFAST : throws an exception when it meets corrupted records. columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord. multiLine (default false): parse one record, which may span multiple lines. {code} > Documentation for CSV datasets in SQL programming guide > --- > > Key: SPARK-20055 > URL: https://issues.apache.org/jira/browse/SPARK-20055 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > I guess things commonly used and important are documented there rather than > documenting everything and every option in the programming guide - > http://spark.apache.org/docs/latest/sql-programming-guide.html. > It seems JSON datasets > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > are documented whereas CSV datasets are not. > Nowadays, they are pretty similar in APIs and options. Some options are > notable for both, In particular, ones such as {{wholeFile}}. Moreover, > several options such as {{inferSchema}} and {{header}} are important in CSV > that affect the
[jira] [Resolved] (SPARK-22159) spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled
[ https://issues.apache.org/jira/browse/SPARK-22159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-22159. --- Resolution: Fixed Fix Version/s: 2.3.0 > spark.sql.execution.arrow.enable and > spark.sql.codegen.aggregate.map.twolevel.enable -> enabled > --- > > Key: SPARK-22159 > URL: https://issues.apache.org/jira/browse/SPARK-22159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.3.0 > > > We should make the config names consistent. They are supposed to end with > ".enabled", rather than "enable". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22153) Rename ShuffleExchange -> ShuffleExchangeExec
[ https://issues.apache.org/jira/browse/SPARK-22153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-22153. --- Resolution: Fixed Fix Version/s: 2.3.0 > Rename ShuffleExchange -> ShuffleExchangeExec > - > > Key: SPARK-22153 > URL: https://issues.apache.org/jira/browse/SPARK-22153 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.3.0 > > > For some reason when we added the Exec suffix to all physical operators, we > missed this one. I was looking for this physical operator today and couldn't > find it, because I was looking for ExchangeExec. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues
[ https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194075#comment-16194075 ] Liang-Chi Hsieh commented on SPARK-19141: - I'm working on a new ML attribute API (SPARK-8515) which is supposed to solve the issues like SPARK-19141 and SPARK-21926 and maybe SPARK-12886. The basic attribute design is similar to previous API. But new API supports sparse Vector attributes. So we don't need to keep every attributes in a Vector column. The new API is also trimming down some unnecessary parts in previous API. > VectorAssembler metadata causing memory issues > -- > > Key: SPARK-19141 > URL: https://issues.apache.org/jira/browse/SPARK-19141 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.0, 2.0.0, 2.1.0 > Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, > 2.0.0, 2.1.0 >Reporter: Antonia Oprescu > > VectorAssembler produces unnecessary metadata that overflows the Java heap in > the case of sparse vectors. In the example below, the logical length of the > vector is 10^6, but the number of non-zero values is only 2. > The problem arises when the vector assembler creates metadata (ML attributes) > for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. > HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it > produces: > {noformat} > {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}} > {noformat} > In this lightweight example, the feature size limit seems to be 1,000,000 > when run locally, but this scales poorly with more complicated routines. With > a larger dataset and a learner (say LogisticRegression), it maxes out > anywhere between 10k and 100k hash size even on a decent sized cluster. > I did some digging, and it seems that the only metadata necessary for > downstream learners is the one indicating categorical columns. Thus, I > thought of the following possible solutions: > 1. Compact representation of ml attributes metadata (but this seems to be a > bigger change) > 2. Removal of non-categorical tags from the metadata created by the > VectorAssembler > 3. An option on the existent VectorAssembler to skip unnecessary ml > attributes or create another transformer altogether > I would happy to take a stab at any of these solutions, but I need some > direction from the Spark community. > {code:title=VABug.scala |borderStyle=solid} > import org.apache.spark.SparkConf > import org.apache.spark.ml.feature.{HashingTF, VectorAssembler} > import org.apache.spark.sql.SparkSession > object VARepro { > case class Record(Label: Double, Feat01: Double, Feat02: Array[String]) > def main(args: Array[String]) { > val conf = new SparkConf() > .setAppName("Vector assembler bug") > .setMaster("local[*]") > val spark = SparkSession.builder.config(conf).getOrCreate() > import spark.implicits._ > val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, > Array("a9ee"))).toDS() > val numFeatures = 1000 > val hashingScheme = new > HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures) > val hashedData = hashingScheme.transform(df) > val vectorAssembler = new > VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features") > val processedData = vectorAssembler.transform(hashedData).select("Label", > "Features") > processedData.show() > } > } > {code} > *Stacktrace from the example above:* > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit > exceeded > at > org.apache.spark.ml.attribute.NumericAttribute.copy(attributes.scala:272) > at > org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:215) > at > org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:195) > at > org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:71) > at > org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:70) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254) > at > scala.collection.SeqViewLike$AbstractTransformed.copyToArray(SeqViewLike.scala:37) > at >
[jira] [Commented] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194074#comment-16194074 ] Liang-Chi Hsieh commented on SPARK-8515: I'm working on a new ML attribute API which is supposed to solve the issues like SPARK-19141 and SPARK-21926 and maybe SPARK-12886. The basic attribute design is similar to previous API. But new API supports sparse Vector attributes. So we don't need to keep every attributes in a Vector column. The new API is also trimming down some unnecessary parts in previous API. cc [~mlnick] > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > Labels: advanced > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
Benyi Wang created SPARK-22211: -- Summary: LimitPushDown optimization for FullOuterJoin generates wrong results Key: SPARK-22211 URL: https://issues.apache.org/jira/browse/SPARK-22211 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Environment: on community.cloude.databrick.com Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) Reporter: Benyi Wang LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may generate a wrong result: Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 is selected, but at right side we have 100K rows including 999, the result will be - one row is (999, 999) - the rest rows are (null, xxx) Once you call show(), the row (999,999) has only 1/10th chance to be selected by CollectLimit. The actual optimization might be, - push down limit - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. Here is my notebook: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html {code:java} import scala.util.Random._ val dl = shuffle(1 to 10).toDF("id") val dr = shuffle(1 to 10).toDF("id") println("data frame dl:") dl.explain println("data frame dr:") dr.explain val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) j.explain j.show(false) {code} {code} data frame dl: == Physical Plan == LocalTableScan [id#10] data frame dr: == Physical Plan == LocalTableScan [id#16] == Physical Plan == CollectLimit 1 +- SortMergeJoin [id#10], [id#16], FullOuter :- *Sort [id#10 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#10, 200) : +- *LocalLimit 1 :+- LocalTableScan [id#10] +- *Sort [id#16 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#16, 200) +- LocalTableScan [id#16] import scala.util.Random._ dl: org.apache.spark.sql.DataFrame = [id: int] dr: org.apache.spark.sql.DataFrame = [id: int] j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] ++---+ |id |id | ++---+ |null|148| ++---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael N updated SPARK-22163: -- Description: The application objects can contain List and can be modified dynamically as well. However, Spark Streaming framework asynchronously serializes the application's objects as the application runs. Therefore, it causes random run-time exception on the List when Spark Streaming framework happens to serializes the application's objects while the application modifies a List in its own object. In fact, there are multiple bugs reported about Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject that are permutation of the same root cause. So the design issue of Spark streaming framework is that it should do this serialization asynchronously. Instead, it should either 1. do this serialization synchronously. This is preferred to eliminate the issue completely. Or 2. Allow it to be configured per application whether to do this serialization synchronously or asynchronously, depending on the nature of each application. Also, Spark documentation should describe the conditions that trigger Spark to do this type of serialization asynchronously, so the applications can work around them until the fix is provided. === Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. — My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead and process batch #4. Instead, they wait for thread #3 until it is done. => So there is already synchronization among the threads within the same batch. Also, batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously. So it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark own thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's asynchronous operations among its own different threads within the same batch that causes this issue. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. That is the root cause of this issue. Further, even if the application does not modify its list of objects, in step 5 the driver could be modifying multiple native objects say two integers. In thread #1 the driver could have updated integer X and before it could update integer Y, when Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. This resulted in a permutation of this issue with a false positive condition where the serialized checkpoint data has partially correct data. One solution for both issues is to modify Spark's design and allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous with Spark's thread #1. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was: The application objects can contain List and can be modified dynamically as well. However, Spark Streaming framework asynchronously serializes the application's objects as the application runs. Therefore, it causes random run-time exception on the List when Spark Streaming framework happens to serializes the application's objects while the application modifies a List in its own object. In fact, there are multiple bugs reported about Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject that are permutation of the same root cause. So the design issue of Spark streaming framework is that it should do this serialization asynchronously. Instead, it should either 1. do this serialization
[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907 ] Michael N edited comment on SPARK-22163 at 10/6/17 12:24 AM: - Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead and process batch #4. Instead, they wait for thread #3 until it is done. => So there is already synchronization among the threads within the same batch. Also, batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously. So it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark own thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's asynchronous operations among its own different threads within the same batch that causes this issue. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. That is the root cause of this issue. Further, even if the application does not modify its list of objects, in step 5 the driver could be modifying multiple native objects say two integers. In thread #1 the driver could have updated integer X and before it could update integer Y, when Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. This resulted in a permutation of this issue with a false positive condition where the serialized checkpoint data has partially correct data. One solution for both issues is to modify Spark's design and allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous with Spark's thread #1. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was (Author: michaeln_apache): Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead and process batch #4. Instead, they wait for thread #3 until it is done. => So there is already synchronization among the threads within the same batch. Also, batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously. So it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark own thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's asynchronous operations among its own different threads within the same batch that causes this issue. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue
[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907 ] Michael N edited comment on SPARK-22163 at 10/6/17 12:21 AM: - Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead and process batch #4. Instead, they wait for thread #3 until it is done. => So there is already synchronization among the threads within the same batch. Also, batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously. So it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark own thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's asynchronous operations among its own different threads within the same batch that causes this issue. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException that is the root cause of this issue. Further, even if the application does not modify its list of objects, in step 5 the driver could be modifying multiple native objects say two integers. In thread #1 the driver could have updated integer X and before it could update integer Y, when Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. This resulted in a permutation of this issue with a false positive condition where the serialized checkpoint data has partially correct data. One solution for both issues is to modify Spark's design and allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous with Spark's thread #1. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was (Author: michaeln_apache): Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads within the same batch. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and
[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907 ] Michael N edited comment on SPARK-22163 at 10/6/17 12:18 AM: - Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads within the same batch. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not modify its list of objects, in step 5 the driver could be modifying multiple native objects say two integers. In thread #1 the driver could have updated integer X and before it could update integer Y, when Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. This resulted in a permutation of this issue with a false positive condition where the serialized checkpoint data has partially correct data. One solution for both issues is to modify Spark's design and allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous with Spark's thread #1. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was (Author: michaeln_apache): Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads within the same batch. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not
[jira] [Updated] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael N updated SPARK-22163: -- Description: The application objects can contain List and can be modified dynamically as well. However, Spark Streaming framework asynchronously serializes the application's objects as the application runs. Therefore, it causes random run-time exception on the List when Spark Streaming framework happens to serializes the application's objects while the application modifies a List in its own object. In fact, there are multiple bugs reported about Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject that are permutation of the same root cause. So the design issue of Spark streaming framework is that it should do this serialization asynchronously. Instead, it should either 1. do this serialization synchronously. This is preferred to eliminate the issue completely. Or 2. Allow it to be configured per application whether to do this serialization synchronously or asynchronously, depending on the nature of each application. Also, Spark documentation should describe the conditions that trigger Spark to do this type of serialization asynchronously, so the applications can work around them until the fix is provided. === Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. — My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead and process batch #4. Instead, they wait for thread #3 until it is done. => So there is already synchronization among the threads within the same batch. Also, batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously. So it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark own thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's asynchronous operations among its own different threads within the same batch that causes this issue. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException that is the root cause of this issue. Further, even if the application does not modify its list of objects, in step 5 the driver could be modifying multiple native objects say two integers. In thread #1 the driver could have updated integer X and before it could update integer Y, when Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. This resulted in a permutation of this issue with a false positive condition where the serialized checkpoint data has partially correct data. One solution for both issues is to modify Spark's design and allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous with Spark's thread #1. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was: The application objects can contain List and can be modified dynamically as well. However, Spark Streaming framework asynchronously serializes the application's objects as the application runs. Therefore, it causes random run-time exception on the List when Spark Streaming framework happens to serializes the application's objects while the application modifies a List in its own object. In fact, there are multiple bugs reported about Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject that are permutation of the same root cause. So the design issue of Spark streaming framework is that it should do this serialization asynchronously. Instead, it should either 1. do this serialization
[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907 ] Michael N edited comment on SPARK-22163 at 10/5/17 11:47 PM: - Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 for the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads within the same batch. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not modify its list of objects, in step 5 the driver could be modifying multiple native objects say two integers. in thread #1 the driver could have updated integer X and before it could update integer Y, when Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. This resulted in a permutation of this issue of false positive condition where the serialized checkpoint data has partial correct data. One solution for both issues is the modify Spark's design to allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous with Spark's thread #1. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was (Author: michaeln_apache): Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not modify its list, in step 5 the
[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907 ] Michael N edited comment on SPARK-22163 at 10/5/17 11:42 PM: - Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not modify its list, in step 5 the driver could be modifying multiple native objects say two integers. The driver could have updated integer X and before it could update integer Y, Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. So a permutation of this issue is false positive condition with partial correct data. One solution for both issues is the modify Spark's design to allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous. That way, it is up to individual applications to decide based on the nature of their business requirements and needed throughput. was (Author: michaeln_apache): Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not modify its list, in step 5 the driver could be modifying multiple native objects say two integers. The driver could have updated integer X and before it
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193923#comment-16193923 ] Michael N commented on SPARK-21999: --- Vadim, It relates to serialization. I confirmed that it is due to list of objects being modified and also serialized by Spark's own different threads in the same batch. The details are posted at ticket https://issues.apache.org/jira/browse/SPARK-22163, because it involves Spark's design and not code. Steve, I understand your points of check pointing the application. This is why in ticket https://issues.apache.org/jira/browse/SPARK-22163, I propose that we allow the Spark's serialization of application objects configurable to be either asynchronous or synchronous. That way,individual applications can decide based on the nature of their business requirements and needed throughput. > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at >
[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend
[ https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193916#comment-16193916 ] Xiao Li commented on SPARK-18131: - cc [~WeichenXu123] > Support returning Vector/Dense Vector from backend > -- > > Key: SPARK-18131 > URL: https://issues.apache.org/jira/browse/SPARK-18131 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > For `spark.logit`, there is a `probabilityCol`, which is a vector in the > backend (scala side). When we do collect(select(df, "probabilityCol")), > backend returns the java object handle (memory address). We need to implement > a method to convert a Vector/Dense Vector column as R vector, which can be > read in SparkR. It is a followup JIRA of adding `spark.logit`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907 ] Michael N commented on SPARK-22163: --- Vadim Semenov and Steve Loughran, per your inquiries in ticket https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here because this issue involves Spark's design and not necessarily its code implementation. --- My application does not spin up its own thread. All the threads are controlled by Spark. Batch interval = 5 seconds Batch #3 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave threads are done with this batch 2. Slave A - Spark Thread #2 takes 10 seconds to complete 3. Slave B - Spark Thread #3 takes 1 minutes to complete 4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and process batch #4. Instead, they synchronize with thread B until it is done. => So there is synchronization among the threads within the same batch, and also batch to batch is synchronous. 5. After Spark Thread #3 is done, the driver does other processing to finish the current batch. In my case, it updates a list of objects. The above steps repeat for the next batch #4 and subsequent batches. Based on the exception stack trace, it looks like in step 5, Spark has another thread #4 that serializes application objects asynchronously, so it causes random occurrences of ConcurrentModificationException, because the list of objects is being changed by Spark Thread #1 the driver. So the issue is not that my application "is modifying a collection asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous operations among its own different threads. I understand Spark needs to serializes objects for check point purposes. However, since Spark controls all the threads and their synchronization, it is a Spark design's issue for the lack of synchronization between threads #1 and #4, that triggers ConcurrentModificationException. Further, even if the application does not modify its list, in step 5 the driver could be modifying multiple native objects say two integers. The driver could have updated integer X and before it could update integer Y, Spark's thread #4 asynchronous serializes the application objects. So the persisted serialized data does not match with the actual data. So a permutation of this issue is false positive condition with partial correct data. One solution is the modify Spark's design to allow the serialization of application objects by Spark's thread #4 to be configurable per application to be either asynchronous or synchronous. That way, it is up to individual applications to decide based on the nature of their business requirements. > Design Issue of Spark Streaming that Causes Random Run-time Exception > - > > Key: SPARK-22163 > URL: https://issues.apache.org/jira/browse/SPARK-22163 > Project: Spark > Issue Type: Bug > Components: DStreams, Structured Streaming >Affects Versions: 2.2.0 > Environment: Spark Streaming > Kafka > Linux >Reporter: Michael N > > The application objects can contain List and can be modified dynamically as > well. However, Spark Streaming framework asynchronously serializes the > application's objects as the application runs. Therefore, it causes random > run-time exception on the List when Spark Streaming framework happens to > serializes the application's objects while the application modifies a List in > its own object. > In fact, there are multiple bugs reported about > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject > that are permutation of the same root cause. So the design issue of Spark > streaming framework is that it should do this serialization asynchronously. > Instead, it should either > 1. do this serialization synchronously. This is preferred to eliminate the > issue completely. Or > 2. Allow it to be configured per application whether to do this serialization > synchronously or asynchronously, depending on the nature of each application. > Also, Spark documentation should describe the conditions that trigger Spark > to do this type of serialization asynchronously, so the applications can work > around them until the fix is provided. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend
[ https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193891#comment-16193891 ] Miao Wang commented on SPARK-18131: --- [~felixcheung] We got stuck at the data types definitions. There is a cycle in dependencies for ML data types. Long time ago, I talked with [~smilegator] about this issue and he suggested moving the data types to from ML module to the parent module. But we didn't discuss details and there was no design document either. We need some updates on this issue to add this support in R. > Support returning Vector/Dense Vector from backend > -- > > Key: SPARK-18131 > URL: https://issues.apache.org/jira/browse/SPARK-18131 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > For `spark.logit`, there is a `probabilityCol`, which is a vector in the > backend (scala side). When we do collect(select(df, "probabilityCol")), > backend returns the java object handle (memory address). We need to implement > a method to convert a Vector/Dense Vector column as R vector, which can be > read in SparkR. It is a followup JIRA of adding `spark.logit`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193840#comment-16193840 ] Michael N edited comment on SPARK-21999 at 10/5/17 10:33 PM: - Steve, I'd keep personal opinions of another person separate from assessment of his professional work even for open source projects. I understand that contributors to Apache open source projects are not paid. However, they should still follow best practices and do due diligence on tickets as if they do for their paid jobs. The issue is that Sean used his position with Spark to keep closing tickets that he either does not understand fully or does not have the answers for. Note that there are many other closed tickets with different permutation of ConcurrentModificationException that other people submitted. So this ticket is not an isolated instance. Image he also does the same thing for tickets with his paid job at an employer. Would they let him do that ? For instance, for your paid project with your employer, you open a ticket for an issue you encounter, and he keeps closing it without understanding the issue or providing the answers to the questions in the ticket. Then he tries to block you by asking JIRA to stop you from posting the ticket. Would you be Ok with that ? About your other points, I already modified my code to get around this issue. However, for the longer term, I think for the Spark design that asynchronously serializes application objects for stream applications that runs continuously from batch to batch, that design should be changed. That was why I created ticket https://issues.apache.org/jira/browse/SPARK-22163 so it could be discussed there. I am open to hearing explanations as to why the current design was done the way it is, which was why I posted questions about 1. In the first place, why does Spark serialize the application objects asynchronously while the streaming application is running continuously from batch to batch ? 2. If Spark needs to do this type of serialization at all, why does it not do at the end of the batch synchronously ? But Sean did not provide the answers and instead just kept closing that ticket. If he does not know the answers or information for tickets, he should let someone else who has such information answers them. was (Author: michaeln_apache): Steve, I'd keep personal opinions of another person separate from assessment of his professional work even for open source projects. I understand that contributors to Apache open source projects are not paid. However, they should still follow best practices and do due diligence on tickets as if they do for their paid jobs. The issue is that Sean used his position with Spark to keep closing tickets that he either does not understand fully or does not have the answers for. Note that there are many other closed tickets with different permutation of ConcurrentModificationException that other people submitted. So this ticket is not an isolated instance. Image he also does the same thing for tickets with his paid job at an employer. Would they let him do that ? For instance, for your paid project with your employer, you open a ticket for an issue you encounter, and he keeps closing it without understanding the issue or providing the answers to the questions in the ticket. Then he tries to block you by asking JIRA to stop you from posting the ticket. Would you be Ok with that ? About your other points, I already modified my code to get around this issue. However, for the longer term, I think for the Spark design that asynchronously serializes application objects for stream applications that runs continuously from batch to batch, that design should be changed. That was why I created ticket https://issues.apache.org/jira/browse/SPARK-22163 so it could be discussed there. I am open to hearing explanations as to why the current design was done the way it is, which was why I posted questions about 1. In the first place, why does Spark serialize the application objects asynchronously while the streaming application is running continuously from batch to batch ? 2. If Spark needs to do this type of serialization at all, why does it not do at the end of the batch synchronously ? But Sean did not provide the answers and instead just closed that ticket. > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it
[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors
[ https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193844#comment-16193844 ] yuhao yang commented on SPARK-22195: Thanks for the feedback. I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill the two scenarios: 1. Compute cosine similarity between two arbitrary vectors. 2. Compute cosine similarity between one vector and a group of other Vectors (usually candidates). And again, not everyone using Spark ML know how to implement cosine similarity. > Add cosine similarity to org.apache.spark.ml.linalg.Vectors > --- > > Key: SPARK-22195 > URL: https://issues.apache.org/jira/browse/SPARK-22195 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > https://en.wikipedia.org/wiki/Cosine_similarity: > As the most important measure of similarity, I found it quite useful in some > image and NLP applications according to personal experience. > Suggest to add function for cosine similarity in > org.apache.spark.ml.linalg.Vectors. > Interface: > def cosineSimilarity(v1: Vector, v2: Vector): Double = ... > def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): > Double = ... > Appreciate suggestions and need green light from committers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193840#comment-16193840 ] Michael N commented on SPARK-21999: --- Steve, I'd keep personal opinions of another person separate from assessment of his professional work even for open source projects. I understand that contributors to Apache open source projects are not paid. However, they should still follow best practices and do due diligence on tickets as if they do for their paid jobs. The issue is that Sean used his position with Spark to keep closing tickets that he either does not understand fully or does not have the answers for. Note that there are many other closed tickets with different permutation of ConcurrentModificationException that other people submitted. So this ticket is not an isolated instance. Image he also does the same thing for tickets with his paid job at an employer. Would they let him do that ? For instance, for your paid project with your employer, you open a ticket for an issue you encounter, and he keeps closing it without understanding the issue or providing the answers to the questions in the ticket. Then he tries to block you by asking JIRA to stop you from posting the ticket. Would you be Ok with that ? About your other points, I already modified my code to get around this issue. However, for the longer term, I think for the Spark design that asynchronously serializes application objects for stream applications that runs continuously from batch to batch, that design should be changed. That was why I created ticket https://issues.apache.org/jira/browse/SPARK-22163 so it could be discussed there. I am open to hearing explanations as to why the current design was done the way it is, which was why I posted questions about 1. In the first place, why does Spark serialize the application objects asynchronously while the streaming application is running continuously from batch to batch ? 2. If Spark needs to do this type of serialization at all, why does it not do at the end of the batch synchronously ? But Sean did not provide the answers and instead just closed that ticket. > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at >
[jira] [Comment Edited] (SPARK-21742) BisectingKMeans generate different models with/without caching
[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193691#comment-16193691 ] Ilya Matiach edited comment on SPARK-21742 at 10/5/17 9:12 PM: --- [~podongfeng] The test was just validating that the edge case was hit, even if it fails the algorithm may be fine. For bisecting k-means generating 2 or 3 clusters is fine, please see documentation here: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html Specifically: param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. The fact that the test is failing means that caching the dataset is slightly changing the data representation, either the ordering of the rows or the exact values, in which case k-means may not be hitting the edge case in the test where there are no divisible leaf clusters. This is totally fine, it just means that you shouldn't be writing such a test, or you should find a slightly different cached dataset that does hit the issue to validate that the bug is indeed fixed and bisecting k-means returns fewer than k clusters but does not error out (which it was incorrectly doing previously - failing with a cryptic error message). was (Author: imatiach): [~podongfeng] The test was just validating that the edge case was hit, even if it fails the algorithm may be fine. For bisecting k-means generating 2 or 3 clusters is fine, please see documentation here: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html Specifically: param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. The fact that the test is failing means that caching the dataset is slightly changing the data representation, either the ordering of the rows or the exact values, in which case k-means may not be hitting the edge case in the test where there are no divisible leaf clusters. This is totally fine, it just means that you shouldn't be writing such a test, or you should find a slightly different cached dataset that does hit the issue to validating that the bug is indeed fixed and bisecting k-means returns fewer than k clusters but does not error out (which it was incorrectly doing previously - failing with a cryptic error message). > BisectingKMeans generate different models with/without caching > -- > > Key: SPARK-21742 > URL: https://issues.apache.org/jira/browse/SPARK-21742 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I found that {{BisectingKMeans}} will generate different models if the input > is cached or not. > Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we > cache the input, then the number of centers will change from 2 to 3. > So it looks like a potential bug. > {code} > import org.apache.spark.ml.param.ParamMap > import org.apache.spark.sql.Dataset > import org.apache.spark.ml.clustering._ > import org.apache.spark.ml.linalg._ > import scala.util.Random > case class TestRow(features: org.apache.spark.ml.linalg.Vector) > val rows = 10 > val dim = 1000 > val seed = 42 > val nnz = 130 > val bkm = new > BisectingKMeans().setK(5).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) > val random = new Random(seed) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v)) > val sparseDataset = spark.createDataFrame(rdd) > scala> bkm.fit(sparseDataset).clusterCenters > 17/08/16 17:12:28 WARN BisectingKMeans: The input RDD 579 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res22: Array[org.apache.spark.ml.linalg.Vector] = > Array([0.0,0.0,0.0,0.0,0.0,0.0,0.3081569145071915,0.0,0.0,0.0,0.0,0.1875176493190393,0.0,0.0,0.0,0.33856517726920116,0.0,0.15290274761955236,0.0,0.10820818064086901,0.0,0.0,0.5987249128746422,0.0,0.0,0.3563390364518392,0.0,0.5019914247361699,0.0,0.08711412551574785,0.09199053071837167,0.05749771404790841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5209441786832834,0.0,0.2350595158678447,0.0,0.0,0.0,0.0,0.0,0.0,0.3041334669892575,0.0,0.0,0.32422664760898434,0.0,0.2454271812974,0.0,0.0,0.06846136418797384,0.0,0.0,0.19556839035017104,0.0,0.0,0.08436120694800427,0.0,0.0,0.0,0.30542501045554465,0.0,0.0,0.0,0.16185204843664616,0.2800921624973247,0.0,0.45459861318444555,0.0,0.0,0.0,0.26222502250076374,0.5235099131919367,0.0,0.0,0 > scala> bkm.fit(sparseDataset).clusterCenters.length > 17/08/16 17:12:36 WARN
[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching
[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193691#comment-16193691 ] Ilya Matiach commented on SPARK-21742: -- [~podongfeng] The test was just validating that the edge case was hit, even if it fails the algorithm may be fine. For bisecting k-means generating 2 or 3 clusters is fine, please see documentation here: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html Specifically: param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. The fact that the test is failing means that caching the dataset is slightly changing the data representation, either the ordering of the rows or the exact values, in which case k-means may not be hitting the edge case in the test where there are no divisible leaf clusters. This is totally fine, it just means that you shouldn't be writing such a test, or you should find a slightly different cached dataset that does hit the issue to validating that the bug is indeed fixed and bisecting k-means returns fewer than k clusters but does not error out (which it was incorrectly doing previously - failing with a cryptic error message). > BisectingKMeans generate different models with/without caching > -- > > Key: SPARK-21742 > URL: https://issues.apache.org/jira/browse/SPARK-21742 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I found that {{BisectingKMeans}} will generate different models if the input > is cached or not. > Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we > cache the input, then the number of centers will change from 2 to 3. > So it looks like a potential bug. > {code} > import org.apache.spark.ml.param.ParamMap > import org.apache.spark.sql.Dataset > import org.apache.spark.ml.clustering._ > import org.apache.spark.ml.linalg._ > import scala.util.Random > case class TestRow(features: org.apache.spark.ml.linalg.Vector) > val rows = 10 > val dim = 1000 > val seed = 42 > val nnz = 130 > val bkm = new > BisectingKMeans().setK(5).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) > val random = new Random(seed) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v)) > val sparseDataset = spark.createDataFrame(rdd) > scala> bkm.fit(sparseDataset).clusterCenters > 17/08/16 17:12:28 WARN BisectingKMeans: The input RDD 579 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res22: Array[org.apache.spark.ml.linalg.Vector] = > Array([0.0,0.0,0.0,0.0,0.0,0.0,0.3081569145071915,0.0,0.0,0.0,0.0,0.1875176493190393,0.0,0.0,0.0,0.33856517726920116,0.0,0.15290274761955236,0.0,0.10820818064086901,0.0,0.0,0.5987249128746422,0.0,0.0,0.3563390364518392,0.0,0.5019914247361699,0.0,0.08711412551574785,0.09199053071837167,0.05749771404790841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5209441786832834,0.0,0.2350595158678447,0.0,0.0,0.0,0.0,0.0,0.0,0.3041334669892575,0.0,0.0,0.32422664760898434,0.0,0.2454271812974,0.0,0.0,0.06846136418797384,0.0,0.0,0.19556839035017104,0.0,0.0,0.08436120694800427,0.0,0.0,0.0,0.30542501045554465,0.0,0.0,0.0,0.16185204843664616,0.2800921624973247,0.0,0.45459861318444555,0.0,0.0,0.0,0.26222502250076374,0.5235099131919367,0.0,0.0,0 > scala> bkm.fit(sparseDataset).clusterCenters.length > 17/08/16 17:12:36 WARN BisectingKMeans: The input RDD 667 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res23: Int = 2 > scala> sparseDataset.persist() > res24: sparseDataset.type = [features: vector] > scala> bkm.fit(sparseDataset).clusterCenters > 17/08/16 17:14:35 WARN BisectingKMeans: The input RDD 806 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res26: Array[org.apache.spark.ml.linalg.Vector] = >
[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found
[ https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193675#comment-16193675 ] Ilya Matiach commented on SPARK-16473: -- [~podongfeng] interesting - it looks like the dataset representation is somehow changing when it is cached? My guess is that the row order may be changing or the numeric values may be changing? The test failure itself is ok if the number of clusters is equal to k (which is actually perfectly fine for the algorithm), it just means that the dataset was not generated correctly to hit the very special edge case I was looking for, where one cluster is empty after a split in bisecting k-means. I can't seem to see the test failure error message in your PR, could you run another build and post it here? We may need to add some debugging/print statements everywhere to determine how the data is changing when you cache it - this doesn't mean there is any bug in the algorithm, it just means the test needs to be changed so that the test data, even after caching, is the same as the original one. > BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key > not found > -- > > Key: SPARK-16473 > URL: https://issues.apache.org/jira/browse/SPARK-16473 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.1, 2.0.0 > Environment: AWS EC2 linux instance. >Reporter: Alok Bhandari >Assignee: Ilya Matiach > Fix For: 2.1.1, 2.2.0 > > > Hello , > I am using apache spark 1.6.1. > I am executing bisecting k means algorithm on a specific dataset . > Dataset details :- > K=100, > input vector =100K*100k > Memory assigned 16GB per node , > number of nodes =2. > Till K=75 it os working fine , but when I set k=100 , it fails with > java.util.NoSuchElementException: key not found. > *I suspect it is failing because of lack of some resources , but somehow > exception does not convey anything as why this spark job failed.* > Please can someone point me to root cause of this exception , why it is > failing. > This is the exception stack-trace:- > {code} > java.util.NoSuchElementException: key not found: 166 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:58) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) > at > scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231) > > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125) > > at scala.collection.immutable.List.reduceLeft(List.scala:84) > at > scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) > at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337) > > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) > {code} > Issue is that , it is failing but not giving any explicit message as to why > it failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior
yuhao yang created SPARK-22210: -- Summary: Online LDA variationalTopicInference should use random seed to have stable behavior Key: SPARK-22210 URL: https://issues.apache.org/jira/browse/SPARK-22210 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582 Gamma distribution should use random seed to have consistent behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
[ https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22188: -- Affects Version/s: 2.0.2 2.1.1 > Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack > --- > > Key: SPARK-22188 > URL: https://issues.apache.org/jira/browse/SPARK-22188 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Krishna Pandey >Priority: Minor > Labels: security > > Below HTTP Response headers can be added to improve security. > The HTTP *Strict-Transport-Security* response header (often abbreviated as > HSTS) is a security feature that lets a web site tell browsers that it should > only be communicated with using HTTPS, instead of using HTTP. > *Note:* The Strict-Transport-Security header is ignored by the browser when > your site is accessed using HTTP; this is because an attacker may intercept > HTTP connections and inject the header or remove it. When your site is > accessed over HTTPS with no certificate errors, the browser knows your site > is HTTPS capable and will honor the Strict-Transport-Security header. > *An example scenario* > You log into a free WiFi access point at an airport and start surfing the > web, visiting your online banking service to check your balance and pay a > couple of bills. Unfortunately, the access point you're using is actually a > hacker's laptop, and they're intercepting your original HTTP request and > redirecting you to a clone of your bank's site instead of the real thing. Now > your private data is exposed to the hacker. > Strict Transport Security resolves this problem; as long as you've accessed > your bank's web site once using HTTPS, and the bank's web site uses Strict > Transport Security, your browser will know to automatically use only HTTPS, > which prevents hackers from performing this sort of man-in-the-middle attack. > *Syntax:* > Strict-Transport-Security: max-age= > Strict-Transport-Security: max-age=; includeSubDomains > Strict-Transport-Security: max-age=; preload > Read more at > https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security > The HTTP *X-XSS-Protection* response header is a feature of Internet > Explorer, Chrome and Safari that stops pages from loading when they detect > reflected cross-site scripting (XSS) attacks. > *Syntax:* > X-XSS-Protection: 0 > X-XSS-Protection: 1 > X-XSS-Protection: 1; mode=block > X-XSS-Protection: 1; report= > Read more at > http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection > The HTTP *X-Content-Type-Options* response header is used to protect against > MIME sniffing vulnerabilities. These vulnerabilities can occur when a website > allows users to upload content to a website however the user disguises a > particular file type as something else. This can give them the opportunity to > perform cross-site scripting and compromise the website. Read more at > https://www.keycdn.com/support/x-content-type-options/ and > https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22209) PySpark does not recognize imports from submodules
Joel Croteau created SPARK-22209: Summary: PySpark does not recognize imports from submodules Key: SPARK-22209 URL: https://issues.apache.org/jira/browse/SPARK-22209 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK 1.8, Centos 6 Reporter: Joel Croteau Priority: Minor Using submodule syntax inside a PySpark job seems to create issues. For example, the following: {code:python} import scipy.sparse from pyspark import SparkContext, SparkConf def do_stuff(x): y = scipy.sparse.dok_matrix((1, 1)) y[0, 0] = x return y[0, 0] def init_context(): conf = SparkConf().setAppName("Spark Test") sc = SparkContext(conf=conf) return sc def main(): sc = init_context() data = sc.parallelize([1, 2, 3, 4]) output_data = data.map(do_stuff) print(output_data.collect()) __name__ == '__main__' and main() {code} produces this error: {noformat} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/home/jcroteau/is/pel_selection/test_sparse.py", line 6, in dostuff y = scipy.sparse.dok_matrix((1, 1)) AttributeError: module 'scipy' has no attribute 'sparse' at
[jira] [Updated] (SPARK-22200) Kinesis Receivers stops if Kinesis stream was re-sharded
[ https://issues.apache.org/jira/browse/SPARK-22200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22200: - Component/s: (was: Spark Core) DStreams > Kinesis Receivers stops if Kinesis stream was re-sharded > > > Key: SPARK-22200 > URL: https://issues.apache.org/jira/browse/SPARK-22200 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Alex Mikhailau >Priority: Critical > > Seeing > Cannot find the shard given the shardId shardId-4454 > Cannot get the shard for this ProcessTask, so duplicate KPL user records in > the event of resharding will not be dropped during deaggregation of Amazon > Kinesis records. > after Kinesis stream re-sharding and receivers stop working altogether. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193392#comment-16193392 ] Bago Amirbekian commented on SPARK-21926: - [~mslipper] The trickiest thing about 1 (b) is knowing how to test that it won't change behaviour. I'd like run this past some folks with more MLlib experience to see if there are any obvious issues with this approach that we haven't considered. > Some transformers in spark.ml.feature fail when trying to transform streaming > dataframes > > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > a) Re-design vectorUDT metadata to support missing metadata for some > elements. (This might be a good thing to do anyways SPARK-19141) > b) drop metadata in streaming context. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
[ https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193391#comment-16193391 ] Sayat Satybaldiyev commented on SPARK-22077: yes, when I enclose IPv6 with square brackets then URL parses without any errors. > RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. > - > > Key: SPARK-22077 > URL: https://issues.apache.org/jira/browse/SPARK-22077 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Priority: Minor > > RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. > For example, > sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243" > is parsed as: > host = null > port = -1 > name = null > While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly. > This is happening on our production machines and causing spark to not start > up. > org.apache.spark.SparkException: Invalid Spark URL: > spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243 > at > org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65) > at > org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96) > at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) > at org.apache.spark.executor.Executor.(Executor.scala:121) > at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) > at org.apache.spark.SparkContext.(SparkContext.scala:507) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21871) Check actual bytecode size when compiling generated code
[ https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193390#comment-16193390 ] Apache Spark commented on SPARK-21871: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19440 > Check actual bytecode size when compiling generated code > > > Key: SPARK-21871 > URL: https://issues.apache.org/jira/browse/SPARK-21871 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Critical > Fix For: 2.3.0 > > > In SPARK-21603, we added code to give up code compilation and use interpreter > execution in SparkPlan if the line number of generated functions goes over > maxLinesPerFunction. But, we already have code to collect metrics for > compiled bytecode size in `CodeGenerator` object. So, I think we could easily > reuse the code for this purpose. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193389#comment-16193389 ] Bago Amirbekian commented on SPARK-13030: - Just so I'm clear, does multi-column in this context mean apply one-hot-encoder to each column and then join the resulting vectors? How do you all feel about giving the new OneHotEncoder the same `handleInvalid` semantics as StringIndexer? > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193355#comment-16193355 ] Andrew Ash commented on SPARK-20055: What I would find most useful is a list of available options and their behaviors, like https://github.com/databricks/spark-csv#features Even though that project has been in-lined into Apache Spark, that github page is still the best reference for csv options > Documentation for CSV datasets in SQL programming guide > --- > > Key: SPARK-20055 > URL: https://issues.apache.org/jira/browse/SPARK-20055 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > I guess things commonly used and important are documented there rather than > documenting everything and every option in the programming guide - > http://spark.apache.org/docs/latest/sql-programming-guide.html. > It seems JSON datasets > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > are documented whereas CSV datasets are not. > Nowadays, they are pretty similar in APIs and options. Some options are > notable for both, In particular, ones such as {{wholeFile}}. Moreover, > several options such as {{inferSchema}} and {{header}} are important in CSV > that affect the type/column name of data. > In that sense, I think we might better document CSV datasets with some > examples too because I believe reading CSV is pretty much common use cases. > Also, I think we could also leave some pointers for options of API > documentations for both (rather than duplicating the documentation). > So, my suggestion is, > - Add CSV Datasets section. > - Add links for options for both JSON and CSV that point each API > documentation > - Fix trivial minor fixes together in both sections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15682) Hive partition write looks for root hdfs folder for existence
[ https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15682: -- Summary: Hive partition write looks for root hdfs folder for existence (was: Hive ORC partition write looks for root hdfs folder for existence) > Hive partition write looks for root hdfs folder for existence > - > > Key: SPARK-15682 > URL: https://issues.apache.org/jira/browse/SPARK-15682 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Dipankar > > Scenario: > I am using the below program to create new partition based on the current > date which signifies the run date. > However, it fails citing hdfs folder already exists. It checks the root > folder and not new partition value. > Is partitionBy clause actually not checking the hive metastore or folder till > proc_date= some value. ? and it's just a way to create folders based on > partition key. Not any way related to hive partition ?? > Alternatively, should i use > result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30") > to achieve the result. > But this will not update hive metastore with new partition details. > Is spark orc support not equivalent to HCatStorer API? > My hive table is built with proc_date as partition column. > Source code : > result.registerTempTable("result_tab") > val result_partition = sqlContext.sql("FROM result_tab select > *,'"+curr_date+"' as proc_date") > result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc") > Exception > 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select > *,'2016-05-31' as proc_date > 16/05/31 15:57:34 INFO ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already > exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at SampleApp$.main(SampleApp.scala:31) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14286) Empty ORC table join throws exception
[ https://issues.apache.org/jira/browse/SPARK-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14286: -- Component/s: SQL > Empty ORC table join throws exception > - > > Key: SPARK-14286 > URL: https://issues.apache.org/jira/browse/SPARK-14286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Rajesh Balamohan >Priority: Minor > > When joining with an empty ORC table, sparks throws following exception. > {noformat} > java.sql.SQLException: java.lang.IllegalArgumentException: orcFileOperator: > path /apps/hive/warehouse/test.db/table does not have valid orc files > matching the pattern > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14286) Empty ORC table join throws exception
[ https://issues.apache.org/jira/browse/SPARK-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193344#comment-16193344 ] Dongjoon Hyun commented on SPARK-14286: --- Hi, [~rajesh.balamohan]. How can I reproduce this? When I tested the following in Spark 1.6.3, 2.0.2, 2.1.1, 2.2.0, all passes. {code} scala> sql("create table x(a int) stored as orc") res4: org.apache.spark.sql.DataFrame = [result: string] scala> sql("create table y(a int) stored as orc") res5: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select * from x, y where x.a=y.a").show +---+---+ | a| a| +---+---+ +---+---+ {code} > Empty ORC table join throws exception > - > > Key: SPARK-14286 > URL: https://issues.apache.org/jira/browse/SPARK-14286 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Minor > > When joining with an empty ORC table, sparks throws following exception. > {noformat} > java.sql.SQLException: java.lang.IllegalArgumentException: orcFileOperator: > path /apps/hive/warehouse/test.db/table does not have valid orc files > matching the pattern > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.
[ https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-17047. --- Resolution: Fixed Fix Version/s: 2.3.0 This is resolved by SPARK-17729. {code} scala> sql(""" | CREATE TABLE t2( | ID INT | , CLUSTERED INT | , SCATTERED INT | , RANDOMISED INT | , RANDOM_STRING VARCHAR(50) | , SMALL_VC VARCHAR(10) | , PADDING VARCHAR(10) | ) | CLUSTERED BY (ID) INTO 256 BUCKETS | STORED AS ORC | TBLPROPERTIES ( "orc.compress"="SNAPPY", | "orc.create.index"="true", | "orc.bloom.filter.columns"="ID", | "orc.bloom.filter.fpp"="0.05", | "orc.stripe.size"="268435456", | "orc.row.index.stride"="1" ) | """) scala> spark.version res3: String = 2.3.0-SNAPSHOT {code} > Spark 2 cannot create table when CLUSTERED. > --- > > Key: SPARK-17047 > URL: https://issues.apache.org/jira/browse/SPARK-17047 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.1, 2.2.0 >Reporter: Dr Mich Talebzadeh > Fix For: 2.3.0 > > > This does not work with CLUSTERED BY clause in Spark 2 now! > CREATE TABLE test.dummy2 > ( > ID INT >, CLUSTERED INT >, SCATTERED INT >, RANDOMISED INT >, RANDOM_STRING VARCHAR(50) >, SMALL_VC VARCHAR(10) >, PADDING VARCHAR(10) > ) > CLUSTERED BY (ID) INTO 256 BUCKETS > STORED AS ORC > TBLPROPERTIES ( "orc.compress"="SNAPPY", > "orc.create.index"="true", > "orc.bloom.filter.columns"="ID", > "orc.bloom.filter.fpp"="0.05", > "orc.stripe.size"="268435456", > "orc.row.index.stride"="1" ) > scala> HiveContext.sql(sqltext) > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.
[ https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-17047: -- Component/s: SQL > Spark 2 cannot create table when CLUSTERED. > --- > > Key: SPARK-17047 > URL: https://issues.apache.org/jira/browse/SPARK-17047 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.1, 2.2.0 >Reporter: Dr Mich Talebzadeh > Fix For: 2.3.0 > > > This does not work with CLUSTERED BY clause in Spark 2 now! > CREATE TABLE test.dummy2 > ( > ID INT >, CLUSTERED INT >, SCATTERED INT >, RANDOMISED INT >, RANDOM_STRING VARCHAR(50) >, SMALL_VC VARCHAR(10) >, PADDING VARCHAR(10) > ) > CLUSTERED BY (ID) INTO 256 BUCKETS > STORED AS ORC > TBLPROPERTIES ( "orc.compress"="SNAPPY", > "orc.create.index"="true", > "orc.bloom.filter.columns"="ID", > "orc.bloom.filter.fpp"="0.05", > "orc.stripe.size"="268435456", > "orc.row.index.stride"="1" ) > scala> HiveContext.sql(sqltext) > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.
[ https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-17047: -- Affects Version/s: 2.1.1 2.2.0 > Spark 2 cannot create table when CLUSTERED. > --- > > Key: SPARK-17047 > URL: https://issues.apache.org/jira/browse/SPARK-17047 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.1, 2.2.0 >Reporter: Dr Mich Talebzadeh > Fix For: 2.3.0 > > > This does not work with CLUSTERED BY clause in Spark 2 now! > CREATE TABLE test.dummy2 > ( > ID INT >, CLUSTERED INT >, SCATTERED INT >, RANDOMISED INT >, RANDOM_STRING VARCHAR(50) >, SMALL_VC VARCHAR(10) >, PADDING VARCHAR(10) > ) > CLUSTERED BY (ID) INTO 256 BUCKETS > STORED AS ORC > TBLPROPERTIES ( "orc.compress"="SNAPPY", > "orc.create.index"="true", > "orc.bloom.filter.columns"="ID", > "orc.bloom.filter.fpp"="0.05", > "orc.stripe.size"="268435456", > "orc.row.index.stride"="1" ) > scala> HiveContext.sql(sqltext) > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17047) Spark 2 cannot create table when CLUSTERED.
[ https://issues.apache.org/jira/browse/SPARK-17047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-17047: -- Summary: Spark 2 cannot create table when CLUSTERED. (was: Spark 2 cannot create ORC table when CLUSTERED.) > Spark 2 cannot create table when CLUSTERED. > --- > > Key: SPARK-17047 > URL: https://issues.apache.org/jira/browse/SPARK-17047 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dr Mich Talebzadeh > > This does not work with CLUSTERED BY clause in Spark 2 now! > CREATE TABLE test.dummy2 > ( > ID INT >, CLUSTERED INT >, SCATTERED INT >, RANDOMISED INT >, RANDOM_STRING VARCHAR(50) >, SMALL_VC VARCHAR(10) >, PADDING VARCHAR(10) > ) > CLUSTERED BY (ID) INTO 256 BUCKETS > STORED AS ORC > TBLPROPERTIES ( "orc.compress"="SNAPPY", > "orc.create.index"="true", > "orc.bloom.filter.columns"="ID", > "orc.bloom.filter.fpp"="0.05", > "orc.stripe.size"="268435456", > "orc.row.index.stride"="1" ) > scala> HiveContext.sql(sqltext) > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193247#comment-16193247 ] holdenk commented on SPARK-22202: - [~felixcheung] for Python I think it would not be bad to be consistent, but I'd probably put it at a trivial rather than major level personally. The fix could be the same for both (e.g. create tgz's _then_ run python/r packaging) so I think keeping it together is fine. > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21785) Support create table from a file schema
[ https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193209#comment-16193209 ] Jacky Shen edited comment on SPARK-21785 at 10/5/17 5:28 PM: - Yes, we can create table from existing parquet file and it is a good solution. but for the request itself, the user want to create table with the schema from the given parquet file and specify a different location to the table. Maybe more other table properties. that's the "create table ... using parquet" cannot achieve. {code:java} spark-sql> create table table2 using parquet options(path 'test-data/dec-in-fixed-len.parquet') location '/tmp/table1'; Error in query: LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, you can only specify one of them.(line 1, pos 0) {code} was (Author: jacshen): Yes, we can create table from existing parquet file and it is a good solution. but for the request itself, the user want to create table with the schema from the given parquet file and specify a different location to the table. Maybe more other table properties. that's the "create table ... using parquet" cannot achieve. {code:java} spark-sql> create table table2 using parquet options(path '/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet') location '/tmp/table1'; Error in query: LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, you can only specify one of them.(line 1, pos 0) {code} > Support create table from a file schema > --- > > Key: SPARK-21785 > URL: https://issues.apache.org/jira/browse/SPARK-21785 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: liupengcheng > Labels: sql > > Current spark doest not support creating a table from a file schema > for example: > {code:java} > CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET > '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION > '/user/test/def/'; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193239#comment-16193239 ] Felix Cheung commented on SPARK-22202: -- [~holden.ka...@gmail.com] would you be concerned with the python differences? if not, I'll turn this into just for R. > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193238#comment-16193238 ] Felix Cheung commented on SPARK-22202: -- Yes, exactly. > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21785) Support create table from a file schema
[ https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193209#comment-16193209 ] Jacky Shen edited comment on SPARK-21785 at 10/5/17 5:12 PM: - Yes, we can create table from existing parquet file and it is a good solution. but for the request itself, the user want to create table with the schema from the given parquet file and specify a different location to the table. Maybe more other table properties. that's the "create table ... using parquet" cannot achieve. {code:java} spark-sql> create table table2 using parquet options(path '/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet') location '/tmp/table1'; Error in query: LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, you can only specify one of them.(line 1, pos 0) {code} was (Author: jacshen): Yes, we can create table from existing parquet file and it is a good solution. but for the request itself, the user want to create table with the schema from the given parquet file and specify a different location to the table. Maybe more other table properties. that's the "create table ... using parquet" cannot achieve. {code:sql} spark-sql> create table table2 using parquet options(path '/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet') location '/tmp/table1'; Error in query: LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, you can only specify one of them.(line 1, pos 0) {code} > Support create table from a file schema > --- > > Key: SPARK-21785 > URL: https://issues.apache.org/jira/browse/SPARK-21785 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: liupengcheng > Labels: sql > > Current spark doest not support creating a table from a file schema > for example: > {code:java} > CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET > '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION > '/user/test/def/'; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21785) Support create table from a file schema
[ https://issues.apache.org/jira/browse/SPARK-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193209#comment-16193209 ] Jacky Shen commented on SPARK-21785: Yes, we can create table from existing parquet file and it is a good solution. but for the request itself, the user want to create table with the schema from the given parquet file and specify a different location to the table. Maybe more other table properties. that's the "create table ... using parquet" cannot achieve. {code:sql} spark-sql> create table table2 using parquet options(path '/Users/jacshen/project/git-rep/spark/sql/core/src/test/resources/test-data/dec-in-fixed-len.parquet') location '/tmp/table1'; Error in query: LOCATION and 'path' in OPTIONS are both used to indicate the custom table path, you can only specify one of them.(line 1, pos 0) {code} > Support create table from a file schema > --- > > Key: SPARK-21785 > URL: https://issues.apache.org/jira/browse/SPARK-21785 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: liupengcheng > Labels: sql > > Current spark doest not support creating a table from a file schema > for example: > {code:java} > CREATE EXTERNAL TABLE IF NOT EXISTS test LIKE PARQUET > '/user/test/abc/a.snappy.parquet' STORED AS PARQUET LOCATION > '/user/test/def/'; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21866: Assignee: Apache Spark > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter >Assignee: Apache Spark > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about
[jira] [Assigned] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21866: Assignee: (was: Apache Spark) > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about the origin of the image.
[jira] [Resolved] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage
[ https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22179. --- Resolution: Duplicate OK I think this effectively taken over by the new issue? > percentile_approx should choose the first element if it already reaches the > percentage > -- > > Key: SPARK-22179 > URL: https://issues.apache.org/jira/browse/SPARK-22179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx should choose the first element if it already reaches the > percentage. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1 (instead of 2), because the first value 1 > already reaches 10%. Currently it returns a wrong answer: 2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193212#comment-16193212 ] Apache Spark commented on SPARK-21866: -- User 'imatiach-msft' has created a pull request for this issue: https://github.com/apache/spark/pull/19439 > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string
[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193201#comment-16193201 ] Kazuaki Ishizaki commented on SPARK-19984: -- [~JohnSteidley] Thank you for your comment. Now, I updated my program. Then, I got the same physical plan as what you provided. However, my program works on Spark 2.1.2 RC and master branch. While the physical plan tried to {{SortMergeJoin}} two strings, the generated code seems to {{SortMergeJoin}} string and long value for {{count(1)}}. That strange {{SortMergeJoin}} is what I cannot understand. {code} == Physical Plan == *HashAggregate(keys=[], functions=[count(1)], output=[count#35L]) +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#40L]) +- *Project +- *SortMergeJoin [A#25], [A#20], Inner :- *Sort [A#25 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(A#25, 1) : +- *Project [id#16 AS A#25] :+- *Filter isnotnull(id#16) : +- *FileScan parquet [id#16] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct +- *Sort [A#20 ASC NULLS FIRST], false, 0 +- *Project [id#16 AS A#20] +- *Filter isnotnull(id#16) +- *GlobalLimit 3 +- Exchange SinglePartition +- *LocalLimit 3 +- *FileScan parquet [id#16] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} > ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java' > - > > Key: SPARK-19984 > URL: https://issues.apache.org/jira/browse/SPARK-19984 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Andrey Yakovenko > Attachments: after_adding_count.txt, before_adding_count.txt > > > I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. > This is not permanent error, next time i run it could disappear. > Unfortunately i don't know how to reproduce the issue. As you can see from > the log my logic is pretty complicated. > Here is a part of log i've got (container_1489514660953_0015_01_01) > {code} > 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 151, Column 29: A method named "compare" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIterator(references); > /* 003 */ } > /* 004 */ > /* 005 */ final class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator { > /* 006 */ private Object[] references; > /* 007 */ private scala.collection.Iterator[] inputs; > /* 008 */ private boolean agg_initAgg; > /* 009 */ private boolean agg_bufIsNull; > /* 010 */ private long agg_bufValue; > /* 011 */ private boolean agg_initAgg1; > /* 012 */ private boolean agg_bufIsNull1; > /* 013 */ private long agg_bufValue1; > /* 014 */ private scala.collection.Iterator smj_leftInput; > /* 015 */ private scala.collection.Iterator smj_rightInput; > /* 016 */ private InternalRow smj_leftRow; > /* 017 */ private InternalRow smj_rightRow; > /* 018 */ private UTF8String smj_value2; > /* 019 */ private java.util.ArrayList smj_matches; > /* 020 */ private UTF8String smj_value3; > /* 021 */ private UTF8String smj_value4; > /* 022 */ private org.apache.spark.sql.execution.metric.SQLMetric > smj_numOutputRows; > /* 023 */ private UnsafeRow smj_result; > /* 024 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder; > /* 025 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > smj_rowWriter; > /* 026 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows; > /* 027 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime; > /* 028 */ private UnsafeRow agg_result; > /* 029 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; > /* 030 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter; > /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows1; > /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime1; > /* 033 */ private UnsafeRow agg_result1; > /* 034 */ private >
[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193120#comment-16193120 ] John Steidley commented on SPARK-19984: --- [~kiszk] I noticed one other difference in our plans: yours has count(A#52) mine has count(1). Could that be significant? > ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java' > - > > Key: SPARK-19984 > URL: https://issues.apache.org/jira/browse/SPARK-19984 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Andrey Yakovenko > Attachments: after_adding_count.txt, before_adding_count.txt > > > I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. > This is not permanent error, next time i run it could disappear. > Unfortunately i don't know how to reproduce the issue. As you can see from > the log my logic is pretty complicated. > Here is a part of log i've got (container_1489514660953_0015_01_01) > {code} > 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 151, Column 29: A method named "compare" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIterator(references); > /* 003 */ } > /* 004 */ > /* 005 */ final class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator { > /* 006 */ private Object[] references; > /* 007 */ private scala.collection.Iterator[] inputs; > /* 008 */ private boolean agg_initAgg; > /* 009 */ private boolean agg_bufIsNull; > /* 010 */ private long agg_bufValue; > /* 011 */ private boolean agg_initAgg1; > /* 012 */ private boolean agg_bufIsNull1; > /* 013 */ private long agg_bufValue1; > /* 014 */ private scala.collection.Iterator smj_leftInput; > /* 015 */ private scala.collection.Iterator smj_rightInput; > /* 016 */ private InternalRow smj_leftRow; > /* 017 */ private InternalRow smj_rightRow; > /* 018 */ private UTF8String smj_value2; > /* 019 */ private java.util.ArrayList smj_matches; > /* 020 */ private UTF8String smj_value3; > /* 021 */ private UTF8String smj_value4; > /* 022 */ private org.apache.spark.sql.execution.metric.SQLMetric > smj_numOutputRows; > /* 023 */ private UnsafeRow smj_result; > /* 024 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder; > /* 025 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > smj_rowWriter; > /* 026 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows; > /* 027 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime; > /* 028 */ private UnsafeRow agg_result; > /* 029 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; > /* 030 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter; > /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows1; > /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime1; > /* 033 */ private UnsafeRow agg_result1; > /* 034 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1; > /* 035 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter1; > /* 036 */ > /* 037 */ public GeneratedIterator(Object[] references) { > /* 038 */ this.references = references; > /* 039 */ } > /* 040 */ > /* 041 */ public void init(int index, scala.collection.Iterator[] inputs) { > /* 042 */ partitionIndex = index; > /* 043 */ this.inputs = inputs; > /* 044 */ wholestagecodegen_init_0(); > /* 045 */ wholestagecodegen_init_1(); > /* 046 */ > /* 047 */ } > /* 048 */ > /* 049 */ private void wholestagecodegen_init_0() { > /* 050 */ agg_initAgg = false; > /* 051 */ > /* 052 */ agg_initAgg1 = false; > /* 053 */ > /* 054 */ smj_leftInput = inputs[0]; > /* 055 */ smj_rightInput = inputs[1]; > /* 056 */ > /* 057 */ smj_rightRow = null; > /* 058 */ > /* 059 */ smj_matches = new java.util.ArrayList(); > /* 060 */ > /* 061 */ this.smj_numOutputRows = > (org.apache.spark.sql.execution.metric.SQLMetric) references[0]; > /* 062 */ smj_result = new UnsafeRow(2); > /* 063 */ this.smj_holder = new > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, > 64); > /* 064 */ this.smj_rowWriter = new >
[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193111#comment-16193111 ] Shivaram Venkataraman commented on SPARK-22202: --- I think the differences happen because we build the CRAN package from one of the Hadoop versions ? > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0
[ https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193078#comment-16193078 ] Apache Spark commented on SPARK-22208: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/19438 > Improve percentile_approx by not rounding up targetError and starting from > index 0 > -- > > Key: SPARK-22208 > URL: https://issues.apache.org/jira/browse/SPARK-22208 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx never returns the first element when percentile is in > (relativeError, 1/N], where relativeError default is 1/1, and N is the > total number of elements. But ideally, percentiles in [0, 1/N] should all > return the first element as the answer. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1, because the first value 1 already reaches > 10%. Currently it returns 2. > Based on the paper, targetError is not rounded up, and searching index should > start from 0 instead of 1. By following the paper, we should be able to fix > the cases mentioned above. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0
[ https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22208: Assignee: Apache Spark > Improve percentile_approx by not rounding up targetError and starting from > index 0 > -- > > Key: SPARK-22208 > URL: https://issues.apache.org/jira/browse/SPARK-22208 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > percentile_approx never returns the first element when percentile is in > (relativeError, 1/N], where relativeError default is 1/1, and N is the > total number of elements. But ideally, percentiles in [0, 1/N] should all > return the first element as the answer. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1, because the first value 1 already reaches > 10%. Currently it returns 2. > Based on the paper, targetError is not rounded up, and searching index should > start from 0 instead of 1. By following the paper, we should be able to fix > the cases mentioned above. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0
[ https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22208: Assignee: (was: Apache Spark) > Improve percentile_approx by not rounding up targetError and starting from > index 0 > -- > > Key: SPARK-22208 > URL: https://issues.apache.org/jira/browse/SPARK-22208 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx never returns the first element when percentile is in > (relativeError, 1/N], where relativeError default is 1/1, and N is the > total number of elements. But ideally, percentiles in [0, 1/N] should all > return the first element as the answer. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1, because the first value 1 already reaches > 10%. Currently it returns 2. > Based on the paper, targetError is not rounded up, and searching index should > start from 0 instead of 1. By following the paper, we should be able to fix > the cases mentioned above. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0
Zhenhua Wang created SPARK-22208: Summary: Improve percentile_approx by not rounding up targetError and starting from index 0 Key: SPARK-22208 URL: https://issues.apache.org/jira/browse/SPARK-22208 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default is 1/1, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22206) gapply in R can't work on empty grouping columns
[ https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22206: Assignee: Liang-Chi Hsieh > gapply in R can't work on empty grouping columns > > > Key: SPARK-22206 > URL: https://issues.apache.org/jira/browse/SPARK-22206 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.2.1, 2.3.0, 2.1.3 > > > {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but > {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty > grouping attributes. So {{gapply}} can't work on empty grouping columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22206) gapply in R can't work on empty grouping columns
[ https://issues.apache.org/jira/browse/SPARK-22206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22206. -- Resolution: Fixed Fix Version/s: 2.1.3 2.3.0 2.2.1 Issue resolved by pull request 19436 [https://github.com/apache/spark/pull/19436] > gapply in R can't work on empty grouping columns > > > Key: SPARK-22206 > URL: https://issues.apache.org/jira/browse/SPARK-22206 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.2.1, 2.3.0, 2.1.3 > > > {{gapply}} in R invokes {{FlatMapGroupsInRExec}} in runtime, but > {{FlatMapGroupsInRExec.requiredChildDistribution}} didn't consider empty > grouping attributes. So {{gapply}} can't work on empty grouping columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192938#comment-16192938 ] cold gin commented on SPARK-22201: -- Ok, and thank you, I appreciate your time and feedback. Having the numeric columns automatically pre-selected by default would make for a more robust api imo, (ie - no column list to supply by default). What you said about pandas having a parameter to include strings seems to support the default behavior of numeric columns only also. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin >Priority: Minor > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22201: -- Flags: (was: Patch) Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I'm neutral on it because it's just the default and anyone who cares can always select the exact columns that are desired. At best it's an enhancement request, but not sure it will be taken up. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin >Priority: Minor > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192732#comment-16192732 ] Steve Loughran edited comment on SPARK-21999 at 10/5/17 1:39 PM: - Apache projects are all open source, with an open community. The ASF welcomes contributions, and does request that people follow an etiquette of constructive discourse: https://community.apache.org/contributors/etiquette Personally, I've always found Sean to be helpful and polite, even when I've turned out to be utterly wrong. I'd advise listening to his feedback, and not view it as a personal criticism. Whatever problem you have, faulting him isn't going to fix it. Like I said, projects welcomes contributors and their contributions. # One key area where people can get their hands dirty in a project is documentation. Do you think some content in the spark streaming guide could be improved? Why not add that section & submit a patch for review? # Because you'll need the latest source tree to hand to do it, before you even worry about documenting a problem, why not see if it "has gone away", or at least "moved somewhere else"? Working with the latest versions of the code may seem leading edge, but its the best way to get your problems fixed, and the only way to pick up an immediate patch. # For any issue in a project, it doesn't get fixed ASAP, at least not in a form you can pick up and use unless you have that build environment yourself. If you are using a software product built on spark, well, you'll need to wait for that supplier to update their release, which happens on their release schedule. For the ASF releases, there's a release process and timetable. # Which means: for now, you have to come up with a solution to your problem. Given it's related to structure serialization, well, you can implement your own {{readObject}} and {{writeObject}} methods, and so perhaps wrap the structures being serialized with something which implements a broader locking structure across your data. It's what I'd try first, as it would be an interesting little problem: A datastructure which can have mutating operations invoked while serialization is in progress. Testing this would be fun...you'd need some kind of mock OutputStream which could block on a write() so you could guarantee the writer thread was blocking at the correct place for the mutation call to trigger the failure. As to whether it's architectural? That's something which could be debated. Checkpointing the state of streaming applications is one of those challenging problems, especially if the notion of "where you are" across multiple input streams is hard, and, if the cost of preserving state is high, any attempt to block for the operation will hurt throughput. And changing checkpoint architectures is so fundamental that its not some one-line patch. If you can fix your structures up to be robustly serializable during asynchronous writes, you'll get performance as well as robustness. was (Author: ste...@apache.org): Apache projects are all open source, with an open community. The ASF welcomes contributions, and does request that people follow an etiquette of constructive discourse: https://community.apache.org/contributors/etiquette Personally, I've always found Sean to be helpful and polite, even when I've turned out to be utterly wrong. I'd advise listening to his feedback, and not view it as a personal criticism. Whatever problem you have, faulting him isn't going to fix it. Like I said, projects welcomes contributors and their contributions. # One key area where people can get their hands dirty in a project is documentation. Do you think some content in the spark streaming guide could be improved? Why not add that section & submit a patch for review? # Because you'll need the latest source tree to hand to do it, before you even worry about documenting a problem, why not see if it "has gone away", or at least "moved somewhere else"? Working with the latest versions of the code may seem leading edge, but its the best way to get your problems fixed, and the only way to pick up an immediate patch. # For any issue in a project, it doesn't get fixed ASAP, at least not in a form you can pick up and use unless you have that build environment yourself. If you are using a software product built on spark, well, you'll need to wait for that supplier to update their release, which happens on their release schedule. For the ASF releases, there's a release process and timetable. # Which means: for now, you have to come up with a solution to your problem. Given it's related to structure serialization, well, you can implement your own {{readObject}} and {{writeObject}} methods, and so perhaps wrap the structures being serialized with something which implements a broader locking structure across your data. It's what I'd try
[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin edited comment on SPARK-22201 at 10/5/17 1:24 PM: --- Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the output of the describe() no-args api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated by default you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns is fine, but should be controlled with an includeColTypes parameter, and not included by default imo. was (Author: cold-gin): Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the output of the describe() no-args api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated by default you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin edited comment on SPARK-22201 at 10/5/17 1:23 PM: --- Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the output of the describe() no-args api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated by default you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. was (Author: cold-gin): Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the default output of the describe() api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated by default you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin edited comment on SPARK-22201 at 10/5/17 1:22 PM: --- Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the default output of the describe() api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated by default you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. was (Author: cold-gin): Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the default output of the describe() api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated *by default* you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin edited comment on SPARK-22201 at 10/5/17 1:19 PM: --- Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the default output of the describe() api, it produces several fields (count, mean, stddev, etc) - by default. For all of those output attributes to be populated *by default* you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as the default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. was (Author: cold-gin): Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the *default* output of the describe() api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be populated *by default* you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin edited comment on SPARK-22201 at 10/5/17 12:58 PM: Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the *default* output of the describe() api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be populated *by default* you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as default input. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. was (Author: cold-gin): Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the *default* output of the describe() api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be populated *by default* you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as default input. It is also exactly the reason that the other person in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin edited comment on SPARK-22201 at 10/5/17 12:58 PM: Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the *default* output of the describe() api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be populated *by default* you must include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as default input. It is also exactly the reason that the other person in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. was (Author: cold-gin): Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the *default* output of the describe() api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be populated *by default* it include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as default input. It is also exactly the reason that the other person in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22201) Dataframe describe includes string columns
[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834 ] cold gin commented on SPARK-22201: -- Yes, it is only the default behavior that I think should be reversed; I don't have a problem at all with supporting the stats for strings. If you look the *default* output of the describe() api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be populated *by default* it include only numeric columns. This simple evidence of what the default output produces is my argument for what should be included as default input. It is also exactly the reason that the other person in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is fine, but should be controlled with an includeColTypes parameter, and not included by default. > Dataframe describe includes string columns > -- > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: cold gin > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22131) Add Mesos Secrets Support to the Mesos Driver
[ https://issues.apache.org/jira/browse/SPARK-22131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192799#comment-16192799 ] Apache Spark commented on SPARK-22131: -- User 'susanxhuynh' has created a pull request for this issue: https://github.com/apache/spark/pull/19437 > Add Mesos Secrets Support to the Mesos Driver > - > > Key: SPARK-22131 > URL: https://issues.apache.org/jira/browse/SPARK-22131 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand > > We recently added Secrets support to the Dispatcher (SPARK-20812). In order > to have Driver-to-Executor TLS we need the same support in the Mesos Driver > so a secret can be disseminated to the executors. This JIRA is to move the > current secrets implementation to be used by both frameworks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192732#comment-16192732 ] Steve Loughran commented on SPARK-21999: Apache projects are all open source, with an open community. The ASF welcomes contributions, and does request that people follow an etiquette of constructive discourse: https://community.apache.org/contributors/etiquette Personally, I've always found Sean to be helpful and polite, even when I've turned out to be utterly wrong. I'd advise listening to his feedback, and not view it as a personal criticism. Whatever problem you have, faulting him isn't going to fix it. Like I said, projects welcomes contributors and their contributions. # One key area where people can get their hands dirty in a project is documentation. Do you think some content in the spark streaming guide could be improved? Why not add that section & submit a patch for review? # Because you'll need the latest source tree to hand to do it, before you even worry about documenting a problem, why not see if it "has gone away", or at least "moved somewhere else"? Working with the latest versions of the code may seem leading edge, but its the best way to get your problems fixed, and the only way to pick up an immediate patch. # For any issue in a project, it doesn't get fixed ASAP, at least not in a form you can pick up and use unless you have that build environment yourself. If you are using a software product built on spark, well, you'll need to wait for that supplier to update their release, which happens on their release schedule. For the ASF releases, there's a release process and timetable. # Which means: for now, you have to come up with a solution to your problem. Given it's related to structure serialization, well, you can implement your own {{readObject}} and {{writeObject}} methods, and so perhaps wrap the structures being serialized with something which implements a broader locking structure across your data. It's what I'd try first, as it would be an interesting little problem: A datastructure which can have mutating operations invoked while serialization is in progress. Testing this would be fun...you'd need some kind of mock OutputStream which could block on a write() so you could guarantee the writer thread was blocking at the correct place for the mutation call to trigger the failure. As to whether it's architectural, well, that's something which could be debated. Checkpointing the state of streaming applications is one of those challenging problems, especially if the notion of "where you are" across multiple input streams is hard, and, if the cost of preserving state is high, any attempt to block for the operation will hurt throughput. And changing checkpoint architectures is so fundamental that its not some one-line patch. If you can fix your structures up to be robustly serializable during asynchronous writes, you'll get performance as well as robustness. > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at
[jira] [Updated] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception
[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-22163: --- Priority: Major (was: Critical) > Design Issue of Spark Streaming that Causes Random Run-time Exception > - > > Key: SPARK-22163 > URL: https://issues.apache.org/jira/browse/SPARK-22163 > Project: Spark > Issue Type: Bug > Components: DStreams, Structured Streaming >Affects Versions: 2.2.0 > Environment: Spark Streaming > Kafka > Linux >Reporter: Michael N > > The application objects can contain List and can be modified dynamically as > well. However, Spark Streaming framework asynchronously serializes the > application's objects as the application runs. Therefore, it causes random > run-time exception on the List when Spark Streaming framework happens to > serializes the application's objects while the application modifies a List in > its own object. > In fact, there are multiple bugs reported about > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject > that are permutation of the same root cause. So the design issue of Spark > streaming framework is that it should do this serialization asynchronously. > Instead, it should either > 1. do this serialization synchronously. This is preferred to eliminate the > issue completely. Or > 2. Allow it to be configured per application whether to do this serialization > synchronously or asynchronously, depending on the nature of each application. > Also, Spark documentation should describe the conditions that trigger Spark > to do this type of serialization asynchronously, so the applications can work > around them until the fix is provided. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22207) High memory usage when converting relational data to Hierarchical data
kanika dhuria created SPARK-22207: - Summary: High memory usage when converting relational data to Hierarchical data Key: SPARK-22207 URL: https://issues.apache.org/jira/browse/SPARK-22207 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: kanika dhuria Have 4 tables lineitems ~1.4Gb, orders ~ 330MB customer ~47MB nations ~ 2.2K These tables are related as follows There are multiple lineitems per order (pk, fk:orderkey) There are multiple orders per customer(pk,fk: cust_key) There are multiple customers per nation(pk, fk:nation key) Data is almost evenly distributed. Building hierarchy till 3 levels i.e joining lineitems, orders, customers works good with executor memory 4Gb/2cores Adding nations require 8GB/2 cores or 4GB/1 core memory. == {noformat} val sqlContext = SparkSession.builder() .enableHiveSupport() .config("spark.sql.retainGroupColumns", false) .config("spark.sql.crossJoin.enabled", true) .getOrCreate() val orders = sqlContext.sql("select * from orders") val lineItem = sqlContext.sql("select * from lineitems") val customer = sqlContext.sql("select * from customers") val nation = sqlContext.sql("select * from nations") val lineitemOrders = lineItem.groupBy(col("l_orderkey")).agg(col("l_orderkey"), collect_list(struct(col("l_partkey"), col("l_suppkey"),col("l_linenumber"),col("l_quantity"),col("l_extendedprice"),col("l_discount"),col("l_tax"),col("l_returnflag"),col("l_linestatus"),col("l_shipdate"),col("l_commitdate"),col("l_receiptdate"),col("l_shipinstruct"),col("l_shipmode"))).as("lineitem")).join(orders, orders("O_ORDERKEY")=== lineItem("l_orderkey")).select(col("O_ORDERKEY"), col("O_CUSTKEY"), col("O_ORDERSTATUS"), col("O_TOTALPRICE"), col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), col("O_SHIPPRIORITY"), col("O_COMMENT"), col("lineitem")) val customerList = lineitemOrders.groupBy(col("o_custkey")).agg(col("o_custkey"),collect_list(struct(col("O_ORDERKEY"), col("O_CUSTKEY"), col("O_ORDERSTATUS"), col("O_TOTALPRICE"), col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), col("O_SHIPPRIORITY"), col("O_COMMENT"),col("lineitem"))).as("items")).join(customer,customer("c_custkey")=== lineitemOrders("o_custkey")).select(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items")) val nationList = customerList.groupBy(col("c_nationkey")).agg(col("c_nationkey"),collect_list(struct(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))).as("custList")).join(nation,nation("n_nationkey")===customerList("c_nationkey")).select(col("n_nationkey"),col("n_name"),col("custList")) nationList.write.mode("overwrite").json("filePath") {noformat} If the customeList is saved in a file and then the last agg/join is run separately, it does run fine in 4GB/2 core . I can provide the data if needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org