[jira] [Commented] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows
[ https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750709#comment-15750709 ] Hyukjin Kwon commented on SPARK-18878: -- cc [~srowen] Currently, the test failures were not able to be identified all due to the time limitation (one hour) in AppVeyor. This was increased for my account after manually asking - https://github.com/appveyor/ci/issues/517 but it seems it can't be increased that much. (It is increased up to one hour and 30 minutes). I will specify the errors within each child task after manually testing separately when it is required. > Fix/investigate the more identified test failures in Java/Scala on Windows > -- > > Key: SPARK-18878 > URL: https://issues.apache.org/jira/browse/SPARK-18878 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Hyukjin Kwon > > It seems many tests are being failed on Windows. Some are only related with > tests whereas others are related with the functionalities themselves which > causes actual failures for some APIs on Windows. > The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and > now apparently we could proceed much further (apparently it seems we might > reach the end). > The tests proceeded via AppVeyor - > https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows
Hyukjin Kwon created SPARK-18878: Summary: Fix/investigate the more identified test failures in Java/Scala on Windows Key: SPARK-18878 URL: https://issues.apache.org/jira/browse/SPARK-18878 Project: Spark Issue Type: Test Components: Tests Reporter: Hyukjin Kwon It seems many tests are being failed on Windows. Some are only related with tests whereas others are related with the functionalities themselves which causes actual failures for some APIs on Windows. The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and now apparently we could proceed much further (apparently it seems we might reach the end). The tests proceeded via AppVeyor - https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18669: Assignee: Tathagata Das (was: Apache Spark) > Update Apache docs regard watermarking in Structured Streaming > -- > > Key: SPARK-18669 > URL: https://issues.apache.org/jira/browse/SPARK-18669 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18669: Assignee: Apache Spark (was: Tathagata Das) > Update Apache docs regard watermarking in Structured Streaming > -- > > Key: SPARK-18669 > URL: https://issues.apache.org/jira/browse/SPARK-18669 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750690#comment-15750690 ] Apache Spark commented on SPARK-18669: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16294 > Update Apache docs regard watermarking in Structured Streaming > -- > > Key: SPARK-18669 > URL: https://issues.apache.org/jira/browse/SPARK-18669 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files
[ https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17119: Assignee: Apache Spark > Add configuration property to allow the history server to delete .inprogress > files > -- > > Key: SPARK-17119 > URL: https://issues.apache.org/jira/browse/SPARK-17119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bjorn Jonsson >Assignee: Apache Spark >Priority: Minor > Labels: historyserver > > The History Server (HS) currently only considers completed applications when > deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). > This means that over time, .inprogress files (from failed jobs, jobs where > the SparkContext is not closed, spark-shell exits etc...) can accumulate and > impact the HS. > Instead of having to manually delete these files, maybe users could have the > option of telling the HS to delete all files where (now - > attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete > .inprogress files with lastUpdated older then 7d? > https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files
[ https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750683#comment-15750683 ] Apache Spark commented on SPARK-17119: -- User 'cnZach' has created a pull request for this issue: https://github.com/apache/spark/pull/16293 > Add configuration property to allow the history server to delete .inprogress > files > -- > > Key: SPARK-17119 > URL: https://issues.apache.org/jira/browse/SPARK-17119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bjorn Jonsson >Priority: Minor > Labels: historyserver > > The History Server (HS) currently only considers completed applications when > deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). > This means that over time, .inprogress files (from failed jobs, jobs where > the SparkContext is not closed, spark-shell exits etc...) can accumulate and > impact the HS. > Instead of having to manually delete these files, maybe users could have the > option of telling the HS to delete all files where (now - > attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete > .inprogress files with lastUpdated older then 7d? > https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files
[ https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17119: Assignee: (was: Apache Spark) > Add configuration property to allow the history server to delete .inprogress > files > -- > > Key: SPARK-17119 > URL: https://issues.apache.org/jira/browse/SPARK-17119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bjorn Jonsson >Priority: Minor > Labels: historyserver > > The History Server (HS) currently only considers completed applications when > deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). > This means that over time, .inprogress files (from failed jobs, jobs where > the SparkContext is not closed, spark-shell exits etc...) can accumulate and > impact the HS. > Instead of having to manually delete these files, maybe users could have the > option of telling the HS to delete all files where (now - > attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete > .inprogress files with lastUpdated older then 7d? > https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
Navya Krishnappa created SPARK-18877: Summary: Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20 Key: SPARK-18877 URL: https://issues.apache.org/jira/browse/SPARK-18877 Project: Spark Issue Type: Bug Reporter: Navya Krishnappa When reading below mentioned csv data, even though the maximum decimal precision is 38, following exception is thrown java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20 Decimal 2323366225312000 2433573971400 23233662253000 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18455) General support for correlated subquery processing
[ https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750638#comment-15750638 ] Reynold Xin commented on SPARK-18455: - Thanks for sharing the doc. This is a really well written survey of Spark's subquery support. Do you have documentations on how you plan to do de-correlation and other stuff, i.e. for PR2 and PR4? > General support for correlated subquery processing > -- > > Key: SPARK-18455 > URL: https://issues.apache.org/jira/browse/SPARK-18455 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18455-scoping-doc.pdf > > > Subquery support has been introduced in Spark 2.0. The initial implementation > covers the most common subquery use case: the ones used in TPC queries for > instance. > Spark currently supports the following subqueries: > * Uncorrelated Scalar Subqueries. All cases are supported. > * Correlated Scalar Subqueries. We only allow subqueries that are aggregated > and use equality predicates. > * Predicate Subqueries. IN or Exists type of queries. We allow most > predicates, except when they are pulled from under an Aggregate or Window > operator. In that case we only support equality predicates. > However this does not cover the full range of possible subqueries. This, in > part, has to do with the fact that we currently rewrite all correlated > subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join. > We currently lack supports for the following use cases: > * The use of predicate subqueries in a projection. > * The use of non-equality predicates below Aggregates and or Window operators. > * The use of non-Aggregate subqueries for correlated scalar subqueries. > This JIRA aims to lift these current limitations in subquery processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files
[ https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750614#comment-15750614 ] Felix Cheung commented on SPARK-18862: -- ah :) Would we end up having mllib-gmm.R, mllib-als.R and so on though? I do worry about 20-30 of files for mllib-* > Split SparkR mllib.R into multiple files > > > Key: SPARK-18862 > URL: https://issues.apache.org/jira/browse/SPARK-18862 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to > split it into multiple files to make us easy to maintain: > * mllibClassification.R > * mllibRegression.R > * mllibClustering.R > * mllibFeature.R > or: > * mllib/classification.R > * mllib/regression.R > * mllib/clustering.R > * mllib/features.R > For R convention, it's more prefer the first way. And I'm not sure whether R > supports the second organized way (will check later). Please let me know your > preference. I think the start of a new release cycle is a good opportunity to > do this, since it will involves less conflicts. If this proposal was > approved, I can work on it. > cc [~felixcheung] [~josephkb] [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18876) An error occurred while trying to connect to the Java server
pranludi created SPARK-18876: Summary: An error occurred while trying to connect to the Java server Key: SPARK-18876 URL: https://issues.apache.org/jira/browse/SPARK-18876 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.2, 2.0.1 Environment: Python 2.7.12 Reporter: pranludi I am trying to create spark context object with the following commands on pyspark: ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:35918) Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 963, in start self.socket.connect((self.address, self.port)) File "/usr/local/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused Traceback (most recent call last): File "", line 1, in File "/home/gamedev/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 419, in coalesce File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1131, in __call__ answer = self.gateway_client.send_command(command) File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 881, in send_command connection = self._get_connection() File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 829, in _get_connection connection = self._create_connection() File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 835, in _create_connection connection.start() File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 970, in start raise Py4JNetworkError(msg, e) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:35918) - I try spark version 2.0.0, 2.0.1, 2.0.2 no problem 2.0.0 but 2.0.1, 2.0.2 occur python code -- . df = spark.read.json('hdfs://big_big_400.json') json_log = [] for log in df.collect(): jj = {} try: for f in log.__fields__: if f == 'I_LogDes': if log[f] is not None: log_des_json = json.loads(log[f]) for jf in log_des_json: json_key = add_2(jf) if json_key in jj: json_key = '%s_2' % json_key jj[json_key] = typeIntStr(log_des_json[jf]) else: jj[remove_i(f)] = typeIntStr(log[f]) json_log.append(jj) except: print log # !!! here error occur df = spark.read.json(spark.sparkContext.parallelize(json_log)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-18849. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 16286 [https://github.com/apache/spark/pull/16286] > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > Fix For: 2.1.0 > > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
[ https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-18875: -- Assignee: Dongjoon Hyun > Fix R API doc generation by adding `DESCRIPTION` file > - > > Key: SPARK-18875 > URL: https://issues.apache.org/jira/browse/SPARK-18875 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 1.6.3, 2.0.2 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.3, 2.1.0 > > > Since 1.4.0, R API document index page has a broken link on `DESCRIPTION > file`. This issue aims to fix that. > * Official Latest Website: > http://spark.apache.org/docs/latest/api/R/index.html > * Apache Spark 2.1.0-rc2: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
[ https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-18875. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.3 Issue resolved by pull request 16292 [https://github.com/apache/spark/pull/16292] > Fix R API doc generation by adding `DESCRIPTION` file > - > > Key: SPARK-18875 > URL: https://issues.apache.org/jira/browse/SPARK-18875 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 1.6.3, 2.0.2 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.3, 2.1.0 > > > Since 1.4.0, R API document index page has a broken link on `DESCRIPTION > file`. This issue aims to fix that. > * Official Latest Website: > http://spark.apache.org/docs/latest/api/R/index.html > * Apache Spark 2.1.0-rc2: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files
[ https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750435#comment-15750435 ] Yanbo Liang commented on SPARK-18862: - Great! I found other R packages organize source files in flat structure, so with a bit worried that R can not support subdirectory. Thanks for your reference, it's very helpful. To the naming, I think {{ml}} is not an official name, we still use {{mllib}} for public, see [here|https://github.com/apache/spark/pull/16241/files]. I think grouping by algorithm of family is very reasonable, so I would like to use the name {{mllib-glm.R, mllib-gbt.R, mllib-randomForest.R, etc}}, what do you think of it? Thanks. > Split SparkR mllib.R into multiple files > > > Key: SPARK-18862 > URL: https://issues.apache.org/jira/browse/SPARK-18862 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to > split it into multiple files to make us easy to maintain: > * mllibClassification.R > * mllibRegression.R > * mllibClustering.R > * mllibFeature.R > or: > * mllib/classification.R > * mllib/regression.R > * mllib/clustering.R > * mllib/features.R > For R convention, it's more prefer the first way. And I'm not sure whether R > supports the second organized way (will check later). Please let me know your > preference. I think the start of a new release cycle is a good opportunity to > do this, since it will involves less conflicts. If this proposal was > approved, I can work on it. > cc [~felixcheung] [~josephkb] [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18869) Add TreeNode.p that returns BaseType
[ https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18869. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.3 > Add TreeNode.p that returns BaseType > > > Key: SPARK-18869 > URL: https://issues.apache.org/jira/browse/SPARK-18869 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.3, 2.1.0 > > > After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] > rather than a more specific type. It would be easier for interactive > debugging to introduce a function that returns the BaseType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17822: Fix Version/s: 2.0.3 > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Fix For: 2.0.3, 2.1.0 > > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18793) SparkR vignette update: random forest
[ https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18793: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > SparkR vignette update: random forest > - > > Key: SPARK-18793 > URL: https://issues.apache.org/jira/browse/SPARK-18793 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > Fix For: 2.1.0 > > > Update vignettes to cover randomForest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates
[ https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18865: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > SparkR vignettes MLP and LDA updates > > > Key: SPARK-18865 > URL: https://issues.apache.org/jira/browse/SPARK-18865 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.0 > > > spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated > content. > spark.lda document misses default values for some parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
[ https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18751: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext > > > Key: SPARK-18751 > URL: https://issues.apache.org/jira/browse/SPARK-18751 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.0 > > > When SparkContext.stop is called in Utils.tryOrStopSparkContext (the > following three places), it will cause deadlock because the stop method needs > to wait for the thread running stop to exit. > - ContextCleaner.keepCleaning > - LiveListenerBus.listenerThread.run > - TaskSchedulerImpl.start -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment
[ https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18840: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > HDFSCredentialProvider throws exception in non-HDFS security environment > > > Key: SPARK-18840 > URL: https://issues.apache.org/jira/browse/SPARK-18840 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3, 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.1.0 > > > Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation > token should be existed, this is ok for HDFS environment, but for some cloud > environment like Azure, HDFS is not required, so it will throw exception: > {code} > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627) > {code} > We should also consider this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18835) Do not expose shaded types in JavaTypeInference API
[ https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18835: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Do not expose shaded types in JavaTypeInference API > --- > > Key: SPARK-18835 > URL: https://issues.apache.org/jira/browse/SPARK-18835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Currently, {{inferDataType(TypeToken)}} is called from a different maven > module, and because we shade Guava, that sometimes leads to errors (e.g. when > running tests using maven): > {noformat} > udf3Test(test.org.apache.spark.sql.JavaUDFSuite) Time elapsed: 0.084 sec > <<< ERROR! > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2; > at > test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107) > Results : > Tests in error: > JavaUDFSuite.udf3Test:107 ยป NoSuchMethod > org.apache.spark.sql.catalyst.JavaTyp... > {noformat} > Instead, we shouldn't expose Guava types in these APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18349) Update R API documentation on ml model summary
[ https://issues.apache.org/jira/browse/SPARK-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18349: Fix Version/s: (was: 2.1.1) 2.1.0 > Update R API documentation on ml model summary > -- > > Key: SPARK-18349 > URL: https://issues.apache.org/jira/browse/SPARK-18349 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Miao Wang > Fix For: 2.1.0 > > > It has been discovered that there is a fair bit of consistency in the > documentation of summary functions, eg. > {code} > #' @return \code{summary} returns a summary object of the fitted model, a > list of components > #' including formula, number of features, list of features, feature > importances, number of > #' trees, and tree weights > setMethod("summary", signature(object = "GBTRegressionModel") > {code} > For instance, what should be listed for the return value? Should it be a name > or a phrase, or should it be a list of items; and should there be a longer > description on what they mean, or reference link to Scala doc. > We will need to review this for all model summary implementations in mllib.R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17822: Fix Version/s: (was: 2.1.1) (was: 2.0.3) (was: 2.2.0) 2.1.0 > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Fix For: 2.1.0 > > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception
[ https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18681: Fix Version/s: (was: 2.1.1) 2.1.0 > Throw Filtering is supported only on partition keys of type string exception > > > Key: SPARK-18681 > URL: https://issues.apache.org/jira/browse/SPARK-18681 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang > Fix For: 2.1.0 > > > Cloudera put > {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}} > as the configuration file for the Hive Metastore Server, where > {{hive.metastore.try.direct.sql=false}}. But Spark reading the gateway > configuration file and get default value > {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or > {{getMSC.getConfigValue}} method to obtain the original configuration from > Hive Metastore Server. > {noformat} > spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT); > Time taken: 0.221 seconds > spark-sql> select * from test where part=1 limit 10; > 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from > test where part=1 limit 10] > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938) > at > org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133) > at >
[jira] [Updated] (SPARK-18797) Update spark.logit in sparkr-vignettes
[ https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18797: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Update spark.logit in sparkr-vignettes > -- > > Key: SPARK-18797 > URL: https://issues.apache.org/jira/browse/SPARK-18797 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.0 > > > spark.logit is added in 2.1. We need to update spark-vignettes to reflect the > changes. This is part of SparkR QA work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18812) Clarify "Spark ML"
[ https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18812: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Clarify "Spark ML" > -- > > Key: SPARK-18812 > URL: https://issues.apache.org/jira/browse/SPARK-18812 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.1.0 > > > It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18816: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Assignee: Alex Bozarth >Priority: Blocker > Fix For: 2.1.0 > > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread
[ https://issues.apache.org/jira/browse/SPARK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18811: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Stream Source resolution should happen in StreamExecution thread, not main > thread > - > > Key: SPARK-18811 > URL: https://issues.apache.org/jira/browse/SPARK-18811 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.1.0 > > > When you start a stream, if we are trying to resolve the source of the > stream, for example if we need to resolve partition columns, this could take > a long time. This long execution time should not block the main thread where > `query.start()` was called on. It should happen in the stream execution > thread possibly before starting any triggers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18760) Provide consistent format output for all file formats
[ https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18760: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Provide consistent format output for all file formats > - > > Key: SPARK-18760 > URL: https://issues.apache.org/jira/browse/SPARK-18760 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > We currently rely on FileFormat implementations to override toString in order > to get a proper explain output. It'd be better to just depend on shortName > for those. > Before: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} > After: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: text, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18325: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > SparkR 2.1 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-18325 > URL: https://issues.apache.org/jira/browse/SPARK-18325 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18590: Fix Version/s: (was: 2.1.1) 2.1.0 > R - Include package vignettes and help pages, build source package in Spark > distribution > > > Key: SPARK-18590 > URL: https://issues.apache.org/jira/browse/SPARK-18590 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.0 > > > We should include in Spark distribution the built source package for SparkR. > This will enable help and vignettes when the package is used. Also this > source package is what we would release to CRAN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values
[ https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18815: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > NPE when collecting column stats for string/binary column having only null > values > - > > Key: SPARK-18815 > URL: https://issues.apache.org/jira/browse/SPARK-18815 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18794) SparkR vignette update: gbt
[ https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18794: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > SparkR vignette update: gbt > --- > > Key: SPARK-18794 > URL: https://issues.apache.org/jira/browse/SPARK-18794 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > Fix For: 2.1.0 > > > Update vignettes to cover gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18795) SparkR vignette update: ksTest
[ https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18795: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > SparkR vignette update: ksTest > -- > > Key: SPARK-18795 > URL: https://issues.apache.org/jira/browse/SPARK-18795 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.1.0 > > > Update vignettes to cover ksTest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values
[ https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18807: Fix Version/s: (was: 2.1.1) 2.1.0 > Should suppress output print for calls to JVM methods with void return values > - > > Key: SPARK-18807 > URL: https://issues.apache.org/jira/browse/SPARK-18807 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.1.0 > > > Several SparkR API calling into JVM methods that have void return values are > getting printed out, especially when running in a REPL or IDE. > example: > > setLogLevel("WARN") > NULL > We should fix this to make the result more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18628) Update handle invalid documentation string
[ https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18628: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Update handle invalid documentation string > -- > > Key: SPARK-18628 > URL: https://issues.apache.org/jira/browse/SPARK-18628 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: Krishna Kalyan >Priority: Trivial > Labels: starter > Fix For: 2.1.0 > > > The handleInvalid paramater documentation string currently doesn't have > quotes around the options, after SPARK-18366 is in, it would be good to > update both the Scala param and Python param to have quotes around the > options making it easier for users to read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots
[ https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18810: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > SparkR install.spark does not work for RCs, snapshots > - > > Key: SPARK-18810 > URL: https://issues.apache.org/jira/browse/SPARK-18810 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung > Fix For: 2.1.0 > > > We publish source archives of the SparkR package now in RCs and in nightly > snapshot builds. One of the problems that still remains is that > `install.spark` does not work for these as it looks for the final Spark > version to be present in the apache download mirrors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18774) Ignore non-existing files when ignoreCorruptFiles is enabled
[ https://issues.apache.org/jira/browse/SPARK-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18774: Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0 > Ignore non-existing files when ignoreCorruptFiles is enabled > > > Key: SPARK-18774 > URL: https://issues.apache.org/jira/browse/SPARK-18774 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18790) Keep a general offset history of stream batches
[ https://issues.apache.org/jira/browse/SPARK-18790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18790: Fix Version/s: (was: 2.1.1) 2.1.0 > Keep a general offset history of stream batches > --- > > Key: SPARK-18790 > URL: https://issues.apache.org/jira/browse/SPARK-18790 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tyson Condie >Assignee: Tyson Condie > Fix For: 2.0.3, 2.1.0 > > > Instead of only keeping the minimum number of offsets around, we should keep > enough information to allow us to roll back n batches and reexecute the > stream starting from a given point. In particular, we should create a config > in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and > ensure that we keep enough log files in the following places to roll back the > specified number of batches: > the offsets that are present in each batch > versions of the state store > the files lists stored for the FileStreamSource > the metadata log stored by the FileStreamSink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18745: Fix Version/s: (was: 2.1.1) 2.1.0 > java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB) > - > > Key: SPARK-18745 > URL: https://issues.apache.org/jira/browse/SPARK-18745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: JESSE CHEN >Assignee: Kazuaki Ishizaki >Priority: Critical > Fix For: 2.0.3, 2.1.0 > > > Running query 68 with decreased executor memory (using 12GB executors instead > of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave > IndexOutOfBoundsException. > The query is as follows: > {noformat} > [select c_last_name >,c_first_name >,ca_city >,bought_city >,ss_ticket_number >,extended_price >,extended_tax >,list_price > from (select ss_ticket_number > ,ss_customer_sk > ,ca_city bought_city > ,sum(ss_ext_sales_price) extended_price > ,sum(ss_ext_list_price) list_price > ,sum(ss_ext_tax) extended_tax >from store_sales >,date_dim >,store >,household_demographics >,customer_address >where store_sales.ss_sold_date_sk = date_dim.d_date_sk > and store_sales.ss_store_sk = store.s_store_sk > and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk > and store_sales.ss_addr_sk = customer_address.ca_address_sk > and date_dim.d_dom between 1 and 2 > and (household_demographics.hd_dep_count = 8 or > household_demographics.hd_vehicle_count= -1) > and date_dim.d_year in (2000,2000+1,2000+2) > and store.s_city in ('Plainview','Rogers') >group by ss_ticket_number >,ss_customer_sk >,ss_addr_sk,ca_city) dn > ,customer > ,customer_address current_addr > where ss_customer_sk = c_customer_sk >and customer.c_current_addr_sk = current_addr.ca_address_sk >and current_addr.ca_city <> bought_city > order by c_last_name > ,ss_ticket_number > limit 100] > {noformat} > Spark output that showed the exception: > {noformat} > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) > at > org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68) > at >
[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16589: Fix Version/s: (was: 2.1.1) 2.1.0 > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Andrew Ray > Labels: correctness > Fix For: 2.0.3, 2.1.0 > > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely
[ https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18843: Fix Version/s: (was: 2.1.1) 2.1.0 > Fix timeout in awaitResultInForkJoinSafely > -- > > Key: SPARK-18843 > URL: https://issues.apache.org/jira/browse/SPARK-18843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.3, 2.1.0 > > > Master has the fix in https://github.com/apache/spark/pull/16230. However, > since we don't merge this PR into master because it's too risky, we should at > least fix the timeout value for 2.0 and 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18856) Newly created catalog table assumed to have 0 rows and 0 bytes
[ https://issues.apache.org/jira/browse/SPARK-18856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18856. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.1.0 > Newly created catalog table assumed to have 0 rows and 0 bytes > -- > > Key: SPARK-18856 > URL: https://issues.apache.org/jira/browse/SPARK-18856 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 2.1.0 > > > {code} > scala> spark.range(100).selectExpr("id % 10 p", > "id").write.partitionBy("p").format("json").saveAsTable("testjson") > scala> spark.table("testjson").queryExecution.optimizedPlan.statistics > res6: org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(sizeInBytes=0, isBroadcastable=false) > {code} > It shouldn't be 0. The issue is that in DataSource.scala, we do: > {code} > val fileCatalog = if > (sparkSession.sqlContext.conf.manageFilesourcePartitions && > catalogTable.isDefined && > catalogTable.get.tracksPartitionsInCatalog) { > new CatalogFileIndex( > sparkSession, > catalogTable.get, > catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L)) > } else { > new InMemoryFileIndex(sparkSession, globbedPaths, options, > Some(partitionSchema)) > } > {code} > We shouldn't use 0L as the fallback. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
[ https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18875: Assignee: Apache Spark > Fix R API doc generation by adding `DESCRIPTION` file > - > > Key: SPARK-18875 > URL: https://issues.apache.org/jira/browse/SPARK-18875 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 1.6.3, 2.0.2 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Since 1.4.0, R API document index page has a broken link on `DESCRIPTION > file`. This issue aims to fix that. > * Official Latest Website: > http://spark.apache.org/docs/latest/api/R/index.html > * Apache Spark 2.1.0-rc2: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
[ https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18875: Assignee: (was: Apache Spark) > Fix R API doc generation by adding `DESCRIPTION` file > - > > Key: SPARK-18875 > URL: https://issues.apache.org/jira/browse/SPARK-18875 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 1.6.3, 2.0.2 >Reporter: Dongjoon Hyun >Priority: Minor > > Since 1.4.0, R API document index page has a broken link on `DESCRIPTION > file`. This issue aims to fix that. > * Official Latest Website: > http://spark.apache.org/docs/latest/api/R/index.html > * Apache Spark 2.1.0-rc2: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
[ https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750404#comment-15750404 ] Apache Spark commented on SPARK-18875: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16292 > Fix R API doc generation by adding `DESCRIPTION` file > - > > Key: SPARK-18875 > URL: https://issues.apache.org/jira/browse/SPARK-18875 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 1.6.3, 2.0.2 >Reporter: Dongjoon Hyun >Priority: Minor > > Since 1.4.0, R API document index page has a broken link on `DESCRIPTION > file`. This issue aims to fix that. > * Official Latest Website: > http://spark.apache.org/docs/latest/api/R/index.html > * Apache Spark 2.1.0-rc2: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
[ https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18875: -- Description: Since 1.4.0, R API document index page has a broken link on `DESCRIPTION file`. This issue aims to fix that. * Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html * Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html was: Currently, R API document index page has a broken link on `DESCRIPTION file`. This issue aims to fix that. * Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html * Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html > Fix R API doc generation by adding `DESCRIPTION` file > - > > Key: SPARK-18875 > URL: https://issues.apache.org/jira/browse/SPARK-18875 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 1.6.3, 2.0.2 >Reporter: Dongjoon Hyun >Priority: Minor > > Since 1.4.0, R API document index page has a broken link on `DESCRIPTION > file`. This issue aims to fix that. > * Official Latest Website: > http://spark.apache.org/docs/latest/api/R/index.html > * Apache Spark 2.1.0-rc2: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file
Dongjoon Hyun created SPARK-18875: - Summary: Fix R API doc generation by adding `DESCRIPTION` file Key: SPARK-18875 URL: https://issues.apache.org/jira/browse/SPARK-18875 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 2.0.2, 1.6.3 Reporter: Dongjoon Hyun Priority: Minor Currently, R API document index page has a broken link on `DESCRIPTION file`. This issue aims to fix that. * Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html * Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750281#comment-15750281 ] Liang-Chi Hsieh commented on SPARK-18281: - [~mwdus...@us.ibm.com] BTW, I updated the fixing and if you have time to test it again, that would be great. Thank you. > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750246#comment-15750246 ] Liang-Chi Hsieh commented on SPARK-18281: - [~mwdus...@us.ibm.com] Thanks for this test case! It is useful to me. However I need to increase the partition number to 1000 to reproduce this issue. The additional partitions will increase the time to materialize RDD elements and so cause timeout. I think we can't set a timeout to the socket reading operation like currently doing as the RDD materialization time is unpredictable. I will keep the connection timeout untouched but unset timeout for socket reading. > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750217#comment-15750217 ] Sital Kedia commented on SPARK-18838: - [~zsxwing] - Its not only the ExecutorAllocationManager, other critical listeners like HeartbeatReceiver also depend on it. In addition to that there might be some latency sensitive user added listener. Making the event processing faster by multi-threading will fix all theses issues. I have an initial version of the PR for this, would appreciate if you can take a look and give feedback on the overall design. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is > causing job failure. For example, a significant delay in receiving the > `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to > remove an executor which is not idle. The event processor in `ListenerBus` > is a single thread which loops through all the Listeners for each event and > processes each event synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > > The single threaded processor often becomes the bottleneck for large jobs. > In addition to that, if one of the Listener is very slow, all the listeners > will pay the price of delay incurred by the slow listener. > To solve the above problems, we plan to have a per listener single threaded > executor service and separate event queue. That way we are not bottlenecked > by the single threaded event processor and also critical listeners will not > be penalized by the slow listeners. The downside of this approach is separate > event queue per listener will increase the driver memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18838: Assignee: (was: Apache Spark) > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is > causing job failure. For example, a significant delay in receiving the > `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to > remove an executor which is not idle. The event processor in `ListenerBus` > is a single thread which loops through all the Listeners for each event and > processes each event synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > > The single threaded processor often becomes the bottleneck for large jobs. > In addition to that, if one of the Listener is very slow, all the listeners > will pay the price of delay incurred by the slow listener. > To solve the above problems, we plan to have a per listener single threaded > executor service and separate event queue. That way we are not bottlenecked > by the single threaded event processor and also critical listeners will not > be penalized by the slow listeners. The downside of this approach is separate > event queue per listener will increase the driver memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750209#comment-15750209 ] Apache Spark commented on SPARK-18838: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/16291 > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is > causing job failure. For example, a significant delay in receiving the > `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to > remove an executor which is not idle. The event processor in `ListenerBus` > is a single thread which loops through all the Listeners for each event and > processes each event synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > > The single threaded processor often becomes the bottleneck for large jobs. > In addition to that, if one of the Listener is very slow, all the listeners > will pay the price of delay incurred by the slow listener. > To solve the above problems, we plan to have a per listener single threaded > executor service and separate event queue. That way we are not bottlenecked > by the single threaded event processor and also critical listeners will not > be penalized by the slow listeners. The downside of this approach is separate > event queue per listener will increase the driver memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18838: Assignee: Apache Spark > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Apache Spark > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is > causing job failure. For example, a significant delay in receiving the > `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to > remove an executor which is not idle. The event processor in `ListenerBus` > is a single thread which loops through all the Listeners for each event and > processes each event synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > > The single threaded processor often becomes the bottleneck for large jobs. > In addition to that, if one of the Listener is very slow, all the listeners > will pay the price of delay incurred by the slow listener. > To solve the above problems, we plan to have a per listener single threaded > executor service and separate event queue. That way we are not bottlenecked > by the single threaded event processor and also critical listeners will not > be penalized by the slow listeners. The downside of this approach is separate > event queue per listener will increase the driver memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
Nattavut Sutyanyong created SPARK-18874: --- Summary: First phase: Deferring the correlated predicate pull up to Optimizer phase Key: SPARK-18874 URL: https://issues.apache.org/jira/browse/SPARK-18874 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Nattavut Sutyanyong This JIRA implements the first phase of SPARK-18455 by deferring the correlated predicate pull up from Analyzer to Optimizer. The goal is to preserve the current functionality of subquery in Spark 2.0 (if it works, it continues to work after this JIRA, if it does not, it won't). The performance of subquery processing is expected to be at par with Spark 2.0. The representation of the LogicalPlan after Analyzer will be different after this JIRA that it will preserve the original positions of correlated predicates in a subquery. This new representation is a preparation work for the second phase of extending the support of correlated subquery to cases Spark 2.0 does not support such as deep correlation, outer references in SELECT clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18873) New test cases for scalar subquery
Nattavut Sutyanyong created SPARK-18873: --- Summary: New test cases for scalar subquery Key: SPARK-18873 URL: https://issues.apache.org/jira/browse/SPARK-18873 Project: Spark Issue Type: Sub-task Components: SQL, Tests Reporter: Nattavut Sutyanyong This JIRA is for submitting a PR for new test cases on scalar subquery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18872) New test cases for EXISTS subquery
Nattavut Sutyanyong created SPARK-18872: --- Summary: New test cases for EXISTS subquery Key: SPARK-18872 URL: https://issues.apache.org/jira/browse/SPARK-18872 Project: Spark Issue Type: Sub-task Components: SQL, Tests Reporter: Nattavut Sutyanyong This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test cases. It follows the same idea as the IN subquery test cases which contain simple patterns, then build more complex constructs in both parent and subquery sides. This batch of test cases are mostly, if not all, positive test cases that do not return any syntax errors or unsupported functionality. We make effort to have test cases returning rows in the result set so that they can indirectly detect incorrect result problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
[ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750163#comment-15750163 ] Rishi Kamaleswaran commented on SPARK-18699: Thanks for the reply! Unfortunately neither of those options work in my case. > Spark CSV parsing types other than String throws exception when malformed > - > > Key: SPARK-18699 > URL: https://issues.apache.org/jira/browse/SPARK-18699 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Jakub Nowacki > > If CSV is read and the schema contains any other type than String, exception > is thrown when the string value in CSV is malformed; e.g. if the timestamp > does not match the defined one, an exception is thrown: > {code} > Caused by: java.lang.IllegalArgumentException > at java.sql.Date.valueOf(Date.java:143) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {code} > It behaves similarly with Integer and Long types, from what I've seen. > To my understanding modes PERMISSIVE and DROPMALFORMED should just null the > value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750158#comment-15750158 ] Liang-Chi Hsieh commented on SPARK-18281: - Hi [~holdenk], what you meant for "we immediately do a foreach on the Scala iterator which is somewhat strange."? > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18871) New test cases for IN subquery
Nattavut Sutyanyong created SPARK-18871: --- Summary: New test cases for IN subquery Key: SPARK-18871 URL: https://issues.apache.org/jira/browse/SPARK-18871 Project: Spark Issue Type: Sub-task Components: SQL, Tests Reporter: Nattavut Sutyanyong This JIRA is open for submitting a PR for new test cases for IN/NOT IN subquery. We plan to put approximately 100+ test cases under `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with simple SELECT in both parent and subquery to subqueries with more complex constructs in both sides (joins, aggregates, etc.) Test data include null value, and duplicate values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18455) General support for correlated subquery processing
[ https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750145#comment-15750145 ] Nattavut Sutyanyong commented on SPARK-18455: - Quantified predicate is not planned for this work. It would be a future work. Equality predicates with [ANY | ALL] could be transformed to other currently supported forms but inequality predicates make the transformation more complex. Null values may not be a main hurdle as comparison operators (=, >, >=, <, <=, !=) are null-tolerant operators. > General support for correlated subquery processing > -- > > Key: SPARK-18455 > URL: https://issues.apache.org/jira/browse/SPARK-18455 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18455-scoping-doc.pdf > > > Subquery support has been introduced in Spark 2.0. The initial implementation > covers the most common subquery use case: the ones used in TPC queries for > instance. > Spark currently supports the following subqueries: > * Uncorrelated Scalar Subqueries. All cases are supported. > * Correlated Scalar Subqueries. We only allow subqueries that are aggregated > and use equality predicates. > * Predicate Subqueries. IN or Exists type of queries. We allow most > predicates, except when they are pulled from under an Aggregate or Window > operator. In that case we only support equality predicates. > However this does not cover the full range of possible subqueries. This, in > part, has to do with the fact that we currently rewrite all correlated > subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join. > We currently lack supports for the following use cases: > * The use of predicate subqueries in a projection. > * The use of non-equality predicates below Aggregates and or Window operators. > * The use of non-Aggregate subqueries for correlated scalar subqueries. > This JIRA aims to lift these current limitations in subquery processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18455) General support for correlated subquery processing
[ https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750128#comment-15750128 ] Nattavut Sutyanyong edited comment on SPARK-18455 at 12/15/16 2:17 AM: --- I have attached a scoping document of this work to the record. was (Author: nsyca): Scoping document > General support for correlated subquery processing > -- > > Key: SPARK-18455 > URL: https://issues.apache.org/jira/browse/SPARK-18455 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18455-scoping-doc.pdf > > > Subquery support has been introduced in Spark 2.0. The initial implementation > covers the most common subquery use case: the ones used in TPC queries for > instance. > Spark currently supports the following subqueries: > * Uncorrelated Scalar Subqueries. All cases are supported. > * Correlated Scalar Subqueries. We only allow subqueries that are aggregated > and use equality predicates. > * Predicate Subqueries. IN or Exists type of queries. We allow most > predicates, except when they are pulled from under an Aggregate or Window > operator. In that case we only support equality predicates. > However this does not cover the full range of possible subqueries. This, in > part, has to do with the fact that we currently rewrite all correlated > subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join. > We currently lack supports for the following use cases: > * The use of predicate subqueries in a projection. > * The use of non-equality predicates below Aggregates and or Window operators. > * The use of non-Aggregate subqueries for correlated scalar subqueries. > This JIRA aims to lift these current limitations in subquery processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18455) General support for correlated subquery processing
[ https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nattavut Sutyanyong updated SPARK-18455: Attachment: SPARK-18455-scoping-doc.pdf Scoping document > General support for correlated subquery processing > -- > > Key: SPARK-18455 > URL: https://issues.apache.org/jira/browse/SPARK-18455 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18455-scoping-doc.pdf > > > Subquery support has been introduced in Spark 2.0. The initial implementation > covers the most common subquery use case: the ones used in TPC queries for > instance. > Spark currently supports the following subqueries: > * Uncorrelated Scalar Subqueries. All cases are supported. > * Correlated Scalar Subqueries. We only allow subqueries that are aggregated > and use equality predicates. > * Predicate Subqueries. IN or Exists type of queries. We allow most > predicates, except when they are pulled from under an Aggregate or Window > operator. In that case we only support equality predicates. > However this does not cover the full range of possible subqueries. This, in > part, has to do with the fact that we currently rewrite all correlated > subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join. > We currently lack supports for the following use cases: > * The use of predicate subqueries in a projection. > * The use of non-equality predicates below Aggregates and or Window operators. > * The use of non-Aggregate subqueries for correlated scalar subqueries. > This JIRA aims to lift these current limitations in subquery processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18861) Spark-SQL unconsistent behavior with "struct" expressions
[ https://issues.apache.org/jira/browse/SPARK-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18861. -- Resolution: Not A Problem I see. This seems actually a behaviour documented in https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1196-L1200 The cases except for the first case seem working fine as expected in the current master {code} scala> Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c").createOrReplaceTempView("t1") scala> sql("SELECT case when a>b then struct(cast(a as int), cast(b as int)) else struct(cast(c as int), cast(c as int)) end from t1").show() +-+ |CASE WHEN (a > b) THEN named_struct(col1, CAST(a AS INT), col2, CAST(b AS INT)) ELSE named_struct(col1, CAST(c AS INT), col2, CAST(c AS INT)) END| +-+ | [3,3]| | [4,4]| +-+ {code} I am resolving this {{Not A Problem}} as the issue seems obsolete to me. Please reopen this if anyone feels this is an inappropriate action. > Spark-SQL unconsistent behavior with "struct" expressions > - > > Key: SPARK-18861 > URL: https://issues.apache.org/jira/browse/SPARK-18861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ohad Raviv > > We are getting strangly inconsistent behavior with expressions involving > "struct". Let's start with this simple table: > {quote} > Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c").createOrReplaceTempView("t1") > sql("desc t1").show() > {quote} > Then we get this DF: > {quote} > |col_name|data_type|comment| > | a| int| | > | b| int| | > | c| int| | > {quote} > Now, although we can clearly see that all the fields are of type int, we we > run: > {quote} > sql("SELECT case when a>b then struct(a,b) else struct(c,c) end from t1") > {quote} > we get this error: > {quote} > org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (t1.`a` > > t1.`b`) THEN struct(t1.`a`, t1.`b`) ELSE struct(t1.`c`, t1.`c`) END' due to > data type mismatch: THEN and ELSE expressions should all be same type or > coercible to a common type; line 1 pos 7 > {quote} > if we try this: > {quote} > sql("SELECT case when a>b then struct(cast(a as int), cast(b as int)) else > struct(cast(c as int), cast(c as int)) end from t1") > {quote} > we get another exception: > {quote} > requirement failed: Unresolved attributes found when constructing > LocalRelation. > java.lang.IllegalArgumentException: requirement failed: Unresolved attributes > found when constructing LocalRelation. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.plans.logical.LocalRelation.(LocalRelation.scala:49) > {quote} > However, these do work: > {quote} > sql("SELECT case when a>b then struct(cast(a as double), cast(b as double)) > else struct(cast(c as double), cast(c as double)) end from t1") > sql("SELECT case when a>b then struct(cast(a as string), cast(b as string)) > else struct(cast(c as string), cast(c as string)) end from t1") > {quote} > any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750076#comment-15750076 ] Apache Spark commented on SPARK-18817: -- User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/16290 > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the usersโ home filespace, nor anywhere else > on the file system apart from the R sessionโs temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the systemโs R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (userโs workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18870: Assignee: Tathagata Das (was: Apache Spark) > Distinct aggregates give incorrect answers on streaming dataframes > -- > > Key: SPARK-18870 > URL: https://issues.apache.org/jira/browse/SPARK-18870 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Unsupported operations checking dont check whether AggregationExpression have > isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives > incorrect results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18870: Assignee: Apache Spark (was: Tathagata Das) > Distinct aggregates give incorrect answers on streaming dataframes > -- > > Key: SPARK-18870 > URL: https://issues.apache.org/jira/browse/SPARK-18870 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Blocker > > Unsupported operations checking dont check whether AggregationExpression have > isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives > incorrect results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750061#comment-15750061 ] Apache Spark commented on SPARK-18870: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16289 > Distinct aggregates give incorrect answers on streaming dataframes > -- > > Key: SPARK-18870 > URL: https://issues.apache.org/jira/browse/SPARK-18870 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Unsupported operations checking dont check whether AggregationExpression have > isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives > incorrect results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18783) ML StringIndexer does not work with nested fields
[ https://issues.apache.org/jira/browse/SPARK-18783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-18783. - Resolution: Won't Fix > ML StringIndexer does not work with nested fields > - > > Key: SPARK-18783 > URL: https://issues.apache.org/jira/browse/SPARK-18783 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: manuel garrido > > Using StringIndexer.transform with a nested field (from parsing json data) > results in the output dataframe not having the new column. > {code} > sample = [ > {'city': u'', > 'device': {u'make': u'HTC', >u'os': u'Android'} > }, > {'city': u'Bangalore', > 'device': {u'make': u'Xiaomi', >u'os': u'Android'} > }, > {'city': u'Overpelt', > 'device': {u'make': u'Samsung', >u'os': u'Android'} > } > ] > sample_df = sc.parallelize(sample).toDF() > # First we use a StringIndexer with a non nested field > city_indexer = StringIndexer(inputCol="city", outputCol="cityIndex", > handleInvalid="skip") > city_indexed = city_indexer.fit(sample_df).transform(sample_df) > print([i.asDict() for i in city_indexed.collect()]) > >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u'', > >>>'cityIndex': 0.0}, {'device': {u'make': u'Xiaomi', u'os': u'Android'}, > >>>'city': u'Bangalore', 'cityIndex': 2.0}, {'device': {u'make': u'Samsung', > >>>u'os': u'Android'}, 'city': u'Overpelt', 'cityIndex': 1.0}] > # Now we try with a nested field > os_indexer = StringIndexer(inputCol="device.os", outputCol="osIndex", > handleInvalid="skip") > os_indexed = os_indexer.fit(sample_df).transform(sample_df) > print([i.asDict() for i in os_indexed.collect()]) > >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u''}, {'device': > >>>{u'make': u'Xiaomi', u'os': u'Android'}, 'city': u'Bangalore'}, {'device': > >>>{u'make': u'Samsung', u'os': u'Android'}, 'city': u'Overpelt'}] #===> we > >>>see the field osIndex is not showing up > #If we rename the same field device.os as a flat field it works as expected > os_indexer = StringIndexer(inputCol="device_os", outputCol="osIndex", > handleInvalid="skip") > os_indexed = os_indexer.fit( > sample_df.withColumn('device_os', col('device.os')) > ).transform( > sample_df.withColumn('device_os', col('device.os')) > ) > print([i.asDict() for i in os_indexed.collect()]) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18783) ML StringIndexer does not work with nested fields
[ https://issues.apache.org/jira/browse/SPARK-18783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750057#comment-15750057 ] Joseph K. Bradley commented on SPARK-18783: --- I'd separate this into 2 issues: 1. nested fields (new feature) 2. silent failure during transform For issue #1: I doubt we'll support nested fields soon, though it would be neat to have in the future. One related issue is multi-column support: [SPARK-8418]. For issue #2: This is because of a hack we did to allow PipelineModel.transform() to work without a label column. During fitting, the StringIndexerModel would index the label. But during prediction/transform, there would not be a label. It's documented here: [http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexerModel] We ran into this issue here: [SPARK-8051]. Long term, we should think about adding a Param to PipelineStage to turn the stage on/off during fit/transform. That's a pretty awkward API, though, so we'll have to discuss it. I'm going to close this since I don't think we'll add nesting in the near future, but we can continue the conversation as needed. Thanks! > ML StringIndexer does not work with nested fields > - > > Key: SPARK-18783 > URL: https://issues.apache.org/jira/browse/SPARK-18783 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: manuel garrido > > Using StringIndexer.transform with a nested field (from parsing json data) > results in the output dataframe not having the new column. > {code} > sample = [ > {'city': u'', > 'device': {u'make': u'HTC', >u'os': u'Android'} > }, > {'city': u'Bangalore', > 'device': {u'make': u'Xiaomi', >u'os': u'Android'} > }, > {'city': u'Overpelt', > 'device': {u'make': u'Samsung', >u'os': u'Android'} > } > ] > sample_df = sc.parallelize(sample).toDF() > # First we use a StringIndexer with a non nested field > city_indexer = StringIndexer(inputCol="city", outputCol="cityIndex", > handleInvalid="skip") > city_indexed = city_indexer.fit(sample_df).transform(sample_df) > print([i.asDict() for i in city_indexed.collect()]) > >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u'', > >>>'cityIndex': 0.0}, {'device': {u'make': u'Xiaomi', u'os': u'Android'}, > >>>'city': u'Bangalore', 'cityIndex': 2.0}, {'device': {u'make': u'Samsung', > >>>u'os': u'Android'}, 'city': u'Overpelt', 'cityIndex': 1.0}] > # Now we try with a nested field > os_indexer = StringIndexer(inputCol="device.os", outputCol="osIndex", > handleInvalid="skip") > os_indexed = os_indexer.fit(sample_df).transform(sample_df) > print([i.asDict() for i in os_indexed.collect()]) > >>>[{'device': {u'make': u'HTC', u'os': u'Android'}, 'city': u''}, {'device': > >>>{u'make': u'Xiaomi', u'os': u'Android'}, 'city': u'Bangalore'}, {'device': > >>>{u'make': u'Samsung', u'os': u'Android'}, 'city': u'Overpelt'}] #===> we > >>>see the field osIndex is not showing up > #If we rename the same field device.os as a flat field it works as expected > os_indexer = StringIndexer(inputCol="device_os", outputCol="osIndex", > handleInvalid="skip") > os_indexed = os_indexer.fit( > sample_df.withColumn('device_os', col('device.os')) > ).transform( > sample_df.withColumn('device_os', col('device.os')) > ) > print([i.asDict() for i in os_indexed.collect()]) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
[ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750050#comment-15750050 ] Hyukjin Kwon commented on SPARK-18699: -- BTW, maybe you could try to set {{nullValue}} to {{" "}} or set {{ignoreLeadingWhiteSpace}} and {{ignoreTrailingWhiteSpace}} to {{true}} for now. > Spark CSV parsing types other than String throws exception when malformed > - > > Key: SPARK-18699 > URL: https://issues.apache.org/jira/browse/SPARK-18699 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Jakub Nowacki > > If CSV is read and the schema contains any other type than String, exception > is thrown when the string value in CSV is malformed; e.g. if the timestamp > does not match the defined one, an exception is thrown: > {code} > Caused by: java.lang.IllegalArgumentException > at java.sql.Date.valueOf(Date.java:143) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {code} > It behaves similarly with Integer and Long types, from what I've seen. > To my understanding modes PERMISSIVE and DROPMALFORMED should just null the > value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest
[ https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750043#comment-15750043 ] Joseph K. Bradley commented on SPARK-18795: --- No problem, thanks for understanding. > SparkR vignette update: ksTest > -- > > Key: SPARK-18795 > URL: https://issues.apache.org/jira/browse/SPARK-18795 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.1.1, 2.2.0 > > > Update vignettes to cover ksTest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750041#comment-15750041 ] Joseph K. Bradley commented on SPARK-18374: --- Oh nice, I didn't realize that was in use. I'll start doing that. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel >Assignee: yuhao yang >Priority: Minor > Labels: releasenotes > Fix For: 2.2.0 > > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18849: -- Target Version/s: 2.1.0 > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
[ https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18703. - Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.2.0 > Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not > Dropped Until Normal Termination of JVM > -- > > Key: SPARK-18703 > URL: https://issues.apache.org/jira/browse/SPARK-18703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Below are the files/directories generated for three inserts againsts a Hive > table: > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > The first 18 files are temporary. We do not drop it until the end of JVM > termination. If JVM does not appropriately terminate, these temporary > files/directories will not be dropped. > Only the last two files are needed, as shown below. > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc >
[jira] [Updated] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-18870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-18870: -- Affects Version/s: 2.0.2 Target Version/s: 2.1.0 Component/s: Structured Streaming > Distinct aggregates give incorrect answers on streaming dataframes > -- > > Key: SPARK-18870 > URL: https://issues.apache.org/jira/browse/SPARK-18870 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Unsupported operations checking dont check whether AggregationExpression have > isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives > incorrect results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18870) Distinct aggregates give incorrect answers on streaming dataframes
Tathagata Das created SPARK-18870: - Summary: Distinct aggregates give incorrect answers on streaming dataframes Key: SPARK-18870 URL: https://issues.apache.org/jira/browse/SPARK-18870 Project: Spark Issue Type: Bug Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Unsupported operations checking dont check whether AggregationExpression have isDistinct=true. So `streamingDf.groupBy().agg(countDistinct("key")) ` gives incorrect results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates
[ https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18865: -- Target Version/s: 2.1.0 (was: 2.1.1, 2.2.0) > SparkR vignettes MLP and LDA updates > > > Key: SPARK-18865 > URL: https://issues.apache.org/jira/browse/SPARK-18865 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.1, 2.2.0 > > > spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated > content. > spark.lda document misses default values for some parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates
[ https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18865: -- Fix Version/s: 2.2.0 2.1.1 > SparkR vignettes MLP and LDA updates > > > Key: SPARK-18865 > URL: https://issues.apache.org/jira/browse/SPARK-18865 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.1, 2.2.0 > > > spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated > content. > spark.lda document misses default values for some parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18865) SparkR vignettes MLP and LDA updates
[ https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18865: -- Issue Type: Documentation (was: Bug) > SparkR vignettes MLP and LDA updates > > > Key: SPARK-18865 > URL: https://issues.apache.org/jira/browse/SPARK-18865 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > > spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated > content. > spark.lda document misses default values for some parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18865) SparkR vignettes MLP and LDA updates
[ https://issues.apache.org/jira/browse/SPARK-18865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-18865. -- Resolution: Fixed Assignee: Miao Wang Target Version/s: 2.1.1, 2.2.0 > SparkR vignettes MLP and LDA updates > > > Key: SPARK-18865 > URL: https://issues.apache.org/jira/browse/SPARK-18865 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > > spark.mlp doesn't provide an example. spark.lda and spark.mlp have repeated > content. > spark.lda document misses default values for some parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18869) Add TreeNode.p that returns BaseType
[ https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18869: Description: After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType. (was: After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce lp that returns LogicalPlan, and pp that returns SparkPlan. ) > Add TreeNode.p that returns BaseType > > > Key: SPARK-18869 > URL: https://issues.apache.org/jira/browse/SPARK-18869 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] > rather than a more specific type. It would be easier for interactive > debugging to introduce a function that returns the BaseType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18869) Add TreeNode.p that returns BaseType
[ https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18869: Summary: Add TreeNode.p that returns BaseType (was: Add lp and pp to plan nodes for getting logical plans and physical plans) > Add TreeNode.p that returns BaseType > > > Key: SPARK-18869 > URL: https://issues.apache.org/jira/browse/SPARK-18869 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] > rather than a more specific type. It would be easier for interactive > debugging to introduce lp that returns LogicalPlan, and pp that returns > SparkPlan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans
[ https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18869: Assignee: Apache Spark (was: Reynold Xin) > Add lp and pp to plan nodes for getting logical plans and physical plans > > > Key: SPARK-18869 > URL: https://issues.apache.org/jira/browse/SPARK-18869 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] > rather than a more specific type. It would be easier for interactive > debugging to introduce lp that returns LogicalPlan, and pp that returns > SparkPlan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans
[ https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749968#comment-15749968 ] Apache Spark commented on SPARK-18869: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16288 > Add lp and pp to plan nodes for getting logical plans and physical plans > > > Key: SPARK-18869 > URL: https://issues.apache.org/jira/browse/SPARK-18869 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] > rather than a more specific type. It would be easier for interactive > debugging to introduce lp that returns LogicalPlan, and pp that returns > SparkPlan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans
[ https://issues.apache.org/jira/browse/SPARK-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18869: Assignee: Reynold Xin (was: Apache Spark) > Add lp and pp to plan nodes for getting logical plans and physical plans > > > Key: SPARK-18869 > URL: https://issues.apache.org/jira/browse/SPARK-18869 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] > rather than a more specific type. It would be easier for interactive > debugging to introduce lp that returns LogicalPlan, and pp that returns > SparkPlan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans
Reynold Xin created SPARK-18869: --- Summary: Add lp and pp to plan nodes for getting logical plans and physical plans Key: SPARK-18869 URL: https://issues.apache.org/jira/browse/SPARK-18869 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce lp that returns LogicalPlan, and pp that returns SparkPlan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite
[ https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18868: Assignee: Apache Spark > Flaky Test: StreamingQueryListenerSuite > --- > > Key: SPARK-18868 > URL: https://issues.apache.org/jira/browse/SPARK-18868 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Burak Yavuz >Assignee: Apache Spark > > Example: > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite
[ https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18868: Assignee: (was: Apache Spark) > Flaky Test: StreamingQueryListenerSuite > --- > > Key: SPARK-18868 > URL: https://issues.apache.org/jira/browse/SPARK-18868 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Burak Yavuz > > Example: > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite
[ https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749915#comment-15749915 ] Apache Spark commented on SPARK-18868: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/16287 > Flaky Test: StreamingQueryListenerSuite > --- > > Key: SPARK-18868 > URL: https://issues.apache.org/jira/browse/SPARK-18868 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Burak Yavuz > > Example: > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18849: Assignee: Apache Spark (was: Felix Cheung) > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Apache Spark > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18849: Assignee: Felix Cheung (was: Apache Spark) > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749906#comment-15749906 ] Apache Spark commented on SPARK-18849: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16286 > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent
[ https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-18854: --- Assignee: Reynold Xin > getNodeNumbered and generateTreeString are not consistent > - > > Key: SPARK-18854 > URL: https://issues.apache.org/jira/browse/SPARK-18854 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.3, 2.1.0 > > > This is a bug introduced by subquery handling. generateTreeString numbers > trees including innerChildren (used to print subqueries), but getNodeNumbered > ignores that. As a result, getNodeNumbered is not always correct. > Repro: > {code} > val df = sql("select * from range(10) where id not in " + > "(select id from range(2) union all select id from range(2))") > println("---") > println(df.queryExecution.analyzed.numberedTreeString) > println("---") > println("---") > println(df.queryExecution.analyzed(3)) > println("---") > {code} > Output looks like > {noformat} > --- > 00 Project [id#1L] > 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)] > 02: +- Union > 03: :- Project [id#2L] > 04: : +- Range (0, 2, step=1, splits=None) > 05: +- Project [id#3L] > 06:+- Range (0, 2, step=1, splits=None) > 07+- Range (0, 10, step=1, splits=None) > --- > --- > null > --- > {noformat} > Note that 3 should be the Project node, but getNodeNumbered ignores > innerChild and as a result returns the wrong one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent
[ https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18854. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.3 Target Version/s: 2.0.3, 2.1.0 (was: 2.0.3, 2.1.1, 2.2.0) > getNodeNumbered and generateTreeString are not consistent > - > > Key: SPARK-18854 > URL: https://issues.apache.org/jira/browse/SPARK-18854 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > Fix For: 2.0.3, 2.1.0 > > > This is a bug introduced by subquery handling. generateTreeString numbers > trees including innerChildren (used to print subqueries), but getNodeNumbered > ignores that. As a result, getNodeNumbered is not always correct. > Repro: > {code} > val df = sql("select * from range(10) where id not in " + > "(select id from range(2) union all select id from range(2))") > println("---") > println(df.queryExecution.analyzed.numberedTreeString) > println("---") > println("---") > println(df.queryExecution.analyzed(3)) > println("---") > {code} > Output looks like > {noformat} > --- > 00 Project [id#1L] > 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)] > 02: +- Union > 03: :- Project [id#2L] > 04: : +- Range (0, 2, step=1, splits=None) > 05: +- Project [id#3L] > 06:+- Range (0, 2, step=1, splits=None) > 07+- Range (0, 10, step=1, splits=None) > --- > --- > null > --- > {noformat} > Note that 3 should be the Project node, but getNodeNumbered ignores > innerChild and as a result returns the wrong one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18867) Throw cause if IsolatedClientLoad can't create client
[ https://issues.apache.org/jira/browse/SPARK-18867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18867: Assignee: Apache Spark > Throw cause if IsolatedClientLoad can't create client > - > > Key: SPARK-18867 > URL: https://issues.apache.org/jira/browse/SPARK-18867 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.0.0 > Environment: RStudio 1.0.44 + SparkR (Spark 2.0.2) >Reporter: Wei-Chiu Chuang >Assignee: Apache Spark >Priority: Minor > > If IsolatedClientLoader can't instantiate a class object, it throws > {{InvocationTargetException}}. But the caller doesn't need to know this > exception. Instead, it should throw the exception that causes the > {{InvocationTargetException}}, so that the caller may be able to handle it. > This exception is reproducible if I run the following code snippet in two > RStudio consoles without cleaning sessions. (This is a RStudio issue after > all but in general it may be exhibited in other ways) > {code} > Sys.setenv(SPARK_HOME="/Users/weichiu/Downloads/spark-2.0.2-bin-hadoop2.7") > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = > "2g")) > df <- as.DataFrame(faithful) > sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18867) Throw cause if IsolatedClientLoad can't create client
[ https://issues.apache.org/jira/browse/SPARK-18867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18867: Assignee: (was: Apache Spark) > Throw cause if IsolatedClientLoad can't create client > - > > Key: SPARK-18867 > URL: https://issues.apache.org/jira/browse/SPARK-18867 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.0.0 > Environment: RStudio 1.0.44 + SparkR (Spark 2.0.2) >Reporter: Wei-Chiu Chuang >Priority: Minor > > If IsolatedClientLoader can't instantiate a class object, it throws > {{InvocationTargetException}}. But the caller doesn't need to know this > exception. Instead, it should throw the exception that causes the > {{InvocationTargetException}}, so that the caller may be able to handle it. > This exception is reproducible if I run the following code snippet in two > RStudio consoles without cleaning sessions. (This is a RStudio issue after > all but in general it may be exhibited in other ways) > {code} > Sys.setenv(SPARK_HOME="/Users/weichiu/Downloads/spark-2.0.2-bin-hadoop2.7") > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = > "2g")) > df <- as.DataFrame(faithful) > sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18867) Throw cause if IsolatedClientLoad can't create client
[ https://issues.apache.org/jira/browse/SPARK-18867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749893#comment-15749893 ] Apache Spark commented on SPARK-18867: -- User 'jojochuang' has created a pull request for this issue: https://github.com/apache/spark/pull/16285 > Throw cause if IsolatedClientLoad can't create client > - > > Key: SPARK-18867 > URL: https://issues.apache.org/jira/browse/SPARK-18867 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.0.0 > Environment: RStudio 1.0.44 + SparkR (Spark 2.0.2) >Reporter: Wei-Chiu Chuang >Priority: Minor > > If IsolatedClientLoader can't instantiate a class object, it throws > {{InvocationTargetException}}. But the caller doesn't need to know this > exception. Instead, it should throw the exception that causes the > {{InvocationTargetException}}, so that the caller may be able to handle it. > This exception is reproducible if I run the following code snippet in two > RStudio consoles without cleaning sessions. (This is a RStudio issue after > all but in general it may be exhibited in other ways) > {code} > Sys.setenv(SPARK_HOME="/Users/weichiu/Downloads/spark-2.0.2-bin-hadoop2.7") > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = > "2g")) > df <- as.DataFrame(faithful) > sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org