[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6575: Priority: Blocker (was: Major) > Add configuration to disable schema merging while converting metastore > Parquet tables > - > > Key: SPARK-6575 > URL: https://issues.apache.org/jira/browse/SPARK-6575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Consider a metastore Parquet table that > # doesn't have schema evolution issue > # has lots of data files and/or partitions > In this case, driver schema merging can be both slow and unnecessary. Would > be good to have a configuration to let the use disable schema merging when > converting such a metastore Parquet table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388128#comment-14388128 ] Debasish Das commented on SPARK-5564: - [~sparks] we are trying to access the EC2 dataset but giving error: [ec2-user@ip-172-31-38-56 ~]$ aws s3 ls s3://files.sparks.requester.pays/enwiki_category_text/ A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied Could you please take a look if it is still available for use ? > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4550: --- Assignee: Apache Spark (was: Sandy Ryza) > In sort-based shuffle, store map outputs in serialized form > --- > > Key: SPARK-4550 > URL: https://issues.apache.org/jira/browse/SPARK-4550 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Assignee: Apache Spark >Priority: Critical > Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala > > > One drawback with sort-based shuffle compared to hash-based shuffle is that > it ends up storing many more java objects in memory. If Spark could store > map outputs in serialized form, it could > * spill less often because the serialized form is more compact > * reduce GC pressure > This will only work when the serialized representations of objects are > independent from each other and occupy contiguous segments of memory. E.g. > when Kryo reference tracking is left on, objects may contain pointers to > objects farther back in the stream, which means that the sort can't relocate > objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4550: --- Assignee: Sandy Ryza (was: Apache Spark) > In sort-based shuffle, store map outputs in serialized form > --- > > Key: SPARK-4550 > URL: https://issues.apache.org/jira/browse/SPARK-4550 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala > > > One drawback with sort-based shuffle compared to hash-based shuffle is that > it ends up storing many more java objects in memory. If Spark could store > map outputs in serialized form, it could > * spill less often because the serialized form is more compact > * reduce GC pressure > This will only work when the serialized representations of objects are > independent from each other and occupy contiguous segments of memory. E.g. > when Kryo reference tracking is left on, objects may contain pointers to > objects farther back in the stream, which means that the sort can't relocate > objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6627) Clean up of shuffle code and interfaces
[ https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388107#comment-14388107 ] Apache Spark commented on SPARK-6627: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/5286 > Clean up of shuffle code and interfaces > --- > > Key: SPARK-6627 > URL: https://issues.apache.org/jira/browse/SPARK-6627 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > The shuffle code in Spark is somewhat messy and could use some interface > clean-up, especially with some larger changes outstanding. This is a catch > all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6627) Clean up of shuffle code and interfaces
[ https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6627: --- Assignee: Apache Spark (was: Patrick Wendell) > Clean up of shuffle code and interfaces > --- > > Key: SPARK-6627 > URL: https://issues.apache.org/jira/browse/SPARK-6627 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Patrick Wendell >Assignee: Apache Spark >Priority: Critical > > The shuffle code in Spark is somewhat messy and could use some interface > clean-up, especially with some larger changes outstanding. This is a catch > all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6627) Clean up of shuffle code and interfaces
[ https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6627: --- Assignee: Patrick Wendell (was: Apache Spark) > Clean up of shuffle code and interfaces > --- > > Key: SPARK-6627 > URL: https://issues.apache.org/jira/browse/SPARK-6627 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > The shuffle code in Spark is somewhat messy and could use some interface > clean-up, especially with some larger changes outstanding. This is a catch > all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6627) Clean up of shuffle code and interfaces
Patrick Wendell created SPARK-6627: -- Summary: Clean up of shuffle code and interfaces Key: SPARK-6627 URL: https://issues.apache.org/jira/browse/SPARK-6627 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical The shuffle code in Spark is somewhat messy and could use some interface clean-up, especially with some larger changes outstanding. This is a catch all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4514) SparkContext localProperties does not inherit property updates across thread reuse
[ https://issues.apache.org/jira/browse/SPARK-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388095#comment-14388095 ] Josh Rosen commented on SPARK-4514: --- I don't know that there's a good way to fix this for all arbitrary ways in which users might create or re-use threads. This inheritance behavior is slightly more understandable in cases where users explicitly create child threads. Although our documentation doesn't seem to explicitly promise that properties will be inherited, I think that users might have come to rely on this behavior so I don't think that we can remove it at this point. We can certainly fix it for the AsyncRDDActions case, though, because we can manually thread the properties in the constructor. This pain could have probably been avoided if the original design used something like Scala's {{DynamicVariable}} where you're forced to explicitly consider the scope / lifecycle of the thread-local property. I'm going to try to fix this for the AsyncRDDActions case and will try to improve the documentation to warn about this pitfall for the more general cases involving arbitrary user code. Let me know if you can spot another solution which won't break existing user code that relies on property inheritance in the non-thread-reuse cases. > SparkContext localProperties does not inherit property updates across thread > reuse > -- > > Key: SPARK-4514 > URL: https://issues.apache.org/jira/browse/SPARK-4514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Erik Erlandson >Assignee: Josh Rosen >Priority: Critical > > The current job group id of a Spark context is stored in the > {{localProperties}} member value. This data structure is designed to be > thread local, and its settings are not preserved when {{ComplexFutureAction}} > instantiates a new {{Future}}. > One consequence of this is that {{takeAsync()}} does not behave in the same > way as other async actions, e.g. {{countAsync()}}. For example, this test > (if copied into StatusTrackerSuite.scala), will fail, because > {{"my-job-group2"}} is not propagated to the Future which actually > instantiates the job: > {code:java} > test("getJobIdsForGroup() with takeAsync()") { > sc = new SparkContext("local", "test", new SparkConf(false)) > sc.setJobGroup("my-job-group2", "description") > sc.statusTracker.getJobIdsForGroup("my-job-group2") should be (Seq.empty) > val firstJobFuture = sc.parallelize(1 to 1000, 1).takeAsync(1) > val firstJobId = eventually(timeout(10 seconds)) { > firstJobFuture.jobIds.head > } > eventually(timeout(10 seconds)) { > sc.statusTracker.getJobIdsForGroup("my-job-group2") should be > (Seq(firstJobId)) > } > } > {code} > It also impacts current PR for SPARK-1021, which involves additional uses of > {{ComplexFutureAction}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6625: Target Version/s: 1.3.1 > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Filters such as startsWith, endsWith, contains will be very useful for data > sources that provide search functionality, e.g. Succinct, Elastic Search, > Solr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6626) TwitterUtils.createStream documentation error
Jayson Sunshine created SPARK-6626: -- Summary: TwitterUtils.createStream documentation error Key: SPARK-6626 URL: https://issues.apache.org/jira/browse/SPARK-6626 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Jayson Sunshine Priority: Minor At http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers, under 'Advanced Sources', the documentation provides the following call for Scala: TwitterUtils.createStream(ssc) However, with only one parameter to this method it appears a jssc object is required, not a ssc object: http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html To make the above call work one must instead provide an option argument, for example: TwitterUtils.createStream(ssc, None) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6625: --- Assignee: Reynold Xin > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Filters such as startsWith, endsWith, contains will be very useful for data > sources that provide search functionality, e.g. Succinct, Elastic Search, > Solr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6625: --- Description: Filters such as startsWith, endsWith, contains will be very useful for data sources that provide search functionality, e.g. Succinct, Elastic Search, Solr. (was: Filters such as StartsWith, EndsWith, Contains will be very useful for search-like data sources such as Succinct, Elastic Search, Solr, etc. ) > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Filters such as startsWith, endsWith, contains will be very useful for data > sources that provide search functionality, e.g. Succinct, Elastic Search, > Solr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6625: --- Assignee: (was: Apache Spark) > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Filters such as StartsWith, EndsWith, Contains will be very useful for > search-like data sources such as Succinct, Elastic Search, Solr, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388039#comment-14388039 ] Apache Spark commented on SPARK-6625: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5285 > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Filters such as StartsWith, EndsWith, Contains will be very useful for > search-like data sources such as Succinct, Elastic Search, Solr, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6625: --- Assignee: Apache Spark > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Filters such as StartsWith, EndsWith, Contains will be very useful for > search-like data sources such as Succinct, Elastic Search, Solr, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6625) Add common string filters to data sources
[ https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6625: --- Description: Filters such as StartsWith, EndsWith, Contains will be very useful for search-like data sources such as Succinct, Elastic Search, Solr, etc. was: Filters such as StartsWith, EndsWith, Like (with regex) will be very useful for search-like data sources such as Succinct, Elastic Search, Solr, etc. > Add common string filters to data sources > - > > Key: SPARK-6625 > URL: https://issues.apache.org/jira/browse/SPARK-6625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Filters such as StartsWith, EndsWith, Contains will be very useful for > search-like data sources such as Succinct, Elastic Search, Solr, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6623) Alias DataFrame.na.fill/drop in Python
[ https://issues.apache.org/jira/browse/SPARK-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6623: --- Assignee: (was: Apache Spark) > Alias DataFrame.na.fill/drop in Python > -- > > Key: SPARK-6623 > URL: https://issues.apache.org/jira/browse/SPARK-6623 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > To be more consistent with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6623) Alias DataFrame.na.fill/drop in Python
[ https://issues.apache.org/jira/browse/SPARK-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388026#comment-14388026 ] Apache Spark commented on SPARK-6623: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5284 > Alias DataFrame.na.fill/drop in Python > -- > > Key: SPARK-6623 > URL: https://issues.apache.org/jira/browse/SPARK-6623 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > To be more consistent with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6623) Alias DataFrame.na.fill/drop in Python
[ https://issues.apache.org/jira/browse/SPARK-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6623: --- Assignee: Apache Spark > Alias DataFrame.na.fill/drop in Python > -- > > Key: SPARK-6623 > URL: https://issues.apache.org/jira/browse/SPARK-6623 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > To be more consistent with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388003#comment-14388003 ] Hrishikesh commented on SPARK-6258: --- [~josephkb] Thank you for your response and valuable suggestions! Will send the PR asap. > Python MLlib API missing items: Clustering > -- > > Key: SPARK-6258 > URL: https://issues.apache.org/jira/browse/SPARK-6258 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > This JIRA lists items missing in the Python API for this sub-package of MLlib. > This list may be incomplete, so please check again when sending a PR to add > these features to the Python API. > Also, please check for major disparities between documentation; some parts of > the Python API are less well-documented than their Scala counterparts. Some > items may be listed in the umbrella JIRA linked to this task. > KMeans > * setEpsilon > * setInitializationSteps > KMeansModel > * computeCost > * k > GaussianMixture > * setInitialModel > GaussianMixtureModel > * k > Completely missing items which should be fixed in separate JIRAs (which have > been created and linked to the umbrella JIRA) > * LDA > * PowerIterationClustering > * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388002#comment-14388002 ] Apache Spark commented on SPARK-5124: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5283 > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Fix For: 1.4.0 > > Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6612) Python KMeans parity
[ https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388000#comment-14388000 ] Hrishikesh commented on SPARK-6612: --- Please assign this ticket to me. > Python KMeans parity > > > Key: SPARK-6612 > URL: https://issues.apache.org/jira/browse/SPARK-6612 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > This is a subtask of [SPARK-6258] for the Python API of KMeans. These items > are missing: > KMeans > * setEpsilon > * setInitializationSteps > KMeansModel > * computeCost > * k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3454: --- Assignee: Imran Rashid (was: Apache Spark) > Expose JSON representation of data shown in WebUI > - > > Key: SPARK-3454 > URL: https://issues.apache.org/jira/browse/SPARK-3454 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Assignee: Imran Rashid > Attachments: sparkmonitoringjsondesign.pdf > > > If WebUI support to JSON format extracting, it's helpful for user who want to > analyse stage / task / executor information. > Fortunately, WebUI has renderJson method so we can implement the method in > each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3454: --- Assignee: Apache Spark (was: Imran Rashid) > Expose JSON representation of data shown in WebUI > - > > Key: SPARK-3454 > URL: https://issues.apache.org/jira/browse/SPARK-3454 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > Attachments: sparkmonitoringjsondesign.pdf > > > If WebUI support to JSON format extracting, it's helpful for user who want to > analyse stage / task / executor information. > Fortunately, WebUI has renderJson method so we can implement the method in > each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6625) Add common string filters to data sources
Reynold Xin created SPARK-6625: -- Summary: Add common string filters to data sources Key: SPARK-6625 URL: https://issues.apache.org/jira/browse/SPARK-6625 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Filters such as StartsWith, EndsWith, Like (with regex) will be very useful for search-like data sources such as Succinct, Elastic Search, Solr, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6624) Convert filters into CNF for data sources
Reynold Xin created SPARK-6624: -- Summary: Convert filters into CNF for data sources Key: SPARK-6624 URL: https://issues.apache.org/jira/browse/SPARK-6624 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should turn filters into conjunctive normal form (CNF) before we pass them to data sources. Otherwise, filters are not very useful if there is a single filter with a bunch of ORs. Note that we already try to do some of these in BooleanSimplification, but I think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6623) Alias DataFrame.na.fill/drop in Python
Reynold Xin created SPARK-6623: -- Summary: Alias DataFrame.na.fill/drop in Python Key: SPARK-6623 URL: https://issues.apache.org/jira/browse/SPARK-6623 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin To be more consistent with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6622) Spark SQL cannot communicate with Hive meta store
[ https://issues.apache.org/jira/browse/SPARK-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepak Kumar V updated SPARK-6622: -- Description: I have multiple tables (among them is dw_bid) that are created through Apache Hive. I have data in avro on HDFS that i want to join with dw_bid table, this join needs to be done using Spark SQL. Spark SQL is unable to communicate with Apache Hive Meta store and fails with exception org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- java.sql.SQLException: No suitable driver found for jdbc:mysql://hostname.vip. company.com:3306/HDB at java.sql.DriverManager.getConnection(DriverManager.java:596) Spark Submit Command ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 MySQL Java Conector Versions tried mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib folder) mysql-connector-java-5.1.34.jar mysql-connector-java-5.1.35.jar Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz) $ hive --version Hive 0.13.0.2.1.3.6-2 Subversion git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6 -r 87da9430050fb9cc429d79d95626d26ea382b96c was: I have multiple tables (among them is dw_bid) that are created through Apache Hive. I have data in avro on HDFS that i want to join with dw_bid table, this join needs to be done using Spark SQL. Spark SQL is unable to communicate with Apache Hive Meta store and fails with exception org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- java.sql.SQLException: No suitable driver found for jdbc:mysql://hostname.vip. company.com:3306/HDB at java.sql.DriverManager.getConnection(DriverManager.java:596) Spark Submit Command ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 MySQL Java Conector Versions tried mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib folder) mysql-connector-java-5.1.34.jar mysql-connector-java-5.1.35.jar Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz) $ hive --version Hive 0.13.0.2.1.3.6-2 Subversion git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop
[jira] [Updated] (SPARK-6603) SQLContext.registerFunction -> SQLContext.udf.register
[ https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6603: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-6116 > SQLContext.registerFunction -> SQLContext.udf.register > -- > > Key: SPARK-6603 > URL: https://issues.apache.org/jira/browse/SPARK-6603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > Fix For: 1.3.1, 1.4.0 > > > We didn't change the Python implementation to use that. Maybe the best > strategy is to deprecate SQLContext.registerFunction, and just add > SQLContext.udf.register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6622) Spark SQL cannot communicate with Hive meta store
[ https://issues.apache.org/jira/browse/SPARK-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepak Kumar V updated SPARK-6622: -- Attachment: exception.txt Full stack trace > Spark SQL cannot communicate with Hive meta store > - > > Key: SPARK-6622 > URL: https://issues.apache.org/jira/browse/SPARK-6622 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Deepak Kumar V > Labels: Hive > Attachments: exception.txt > > > I have multiple tables (among them is dw_bid) that are created through Apache > Hive. I have data in avro on HDFS that i want to join with dw_bid table, > this join needs to be done using Spark SQL. > Spark SQL is unable to communicate with Apache Hive Meta store and fails with > exception > org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test > connection to the given database. JDBC url = > jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: -- > java.sql.SQLException: No suitable driver found for > jdbc:mysql://hostname.vip. company.com:3306/HDB > at java.sql.DriverManager.getConnection(DriverManager.java:596) > Spark Submit Command > ./bin/spark-submit -v --master yarn-cluster --driver-class-path > /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar > --jars > /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml > --num-executors 1 --driver-memory 4g --driver-java-options > "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue > hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp > spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 > input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro > subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 > MySQL Java Conector Versions tried > mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib > folder) > mysql-connector-java-5.1.34.jar > mysql-connector-java-5.1.35.jar > Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x > (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz) > $ hive --version > Hive 0.13.0.2.1.3.6-2 > Subversion > git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6 > -r 87da9430050fb9cc429d79d95626d26ea382b96c > $ > Code: > package com.ebay.ep.poc.spark.reporting.process.service > import com.ebay.ep.poc.spark.reporting.process.util.DateUtil._ > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import collection.mutable.HashMap > import com.databricks.spark.avro._ > class HadoopSuccessEvents2Service extends ReportingService { > override def execute(arguments: HashMap[String, String], sc: SparkContext) { > val detail = "reporting.detail." + arguments.get("subcommand").get > val startDate = arguments.get("startDate").get > val endDate = arguments.get("endDate").get > val input = arguments.get("input").get > val output = arguments.get("output").get > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val successDetail_S1 = sqlContext.avroFile(input) > successDetail_S1.registerTempTable("sojsuccessevents1") > > println("show tables") > sqlContext.sql("show tables") > println("show tables") > sqlContext.sql("CREATE TABLE `sojsuccessevents2_spark`( `guid` string > COMMENT 'from deserializer', `sessionkey` bigint COMMENT 'from deserializer', > `sessionstartdate` string COMMENT 'from deserializer', `sojdatadate` string > COMMENT 'from deserializer', `seqnum` int COMMENT 'from deserializer', > `eventtimestamp` string COMMENT 'from deserializer', `siteid` int COMMENT > 'from deserializer', `successeventtype` string COMMENT 'from deserializer', > `sourcetype` string COMMENT 'from deserializer', `itemid` bigint COMMENT > 'from deserializer', `shopcartid` bigint COMMENT 'from deserializer', > `transactionid` bigint COMMENT 'from deserializer', `offerid` bigint COMMENT > 'from deserializer', `userid` bigint COMMENT 'from deserializer', > `priorpage1seqnum` int COMMENT 'f
[jira] [Created] (SPARK-6622) Spark SQL cannot communicate with Hive meta store
Deepak Kumar V created SPARK-6622: - Summary: Spark SQL cannot communicate with Hive meta store Key: SPARK-6622 URL: https://issues.apache.org/jira/browse/SPARK-6622 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Reporter: Deepak Kumar V I have multiple tables (among them is dw_bid) that are created through Apache Hive. I have data in avro on HDFS that i want to join with dw_bid table, this join needs to be done using Spark SQL. Spark SQL is unable to communicate with Apache Hive Meta store and fails with exception org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- java.sql.SQLException: No suitable driver found for jdbc:mysql://hostname.vip. company.com:3306/HDB at java.sql.DriverManager.getConnection(DriverManager.java:596) Spark Submit Command ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 MySQL Java Conector Versions tried mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib folder) mysql-connector-java-5.1.34.jar mysql-connector-java-5.1.35.jar Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz) $ hive --version Hive 0.13.0.2.1.3.6-2 Subversion git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6 -r 87da9430050fb9cc429d79d95626d26ea382b96c $ Code: package com.ebay.ep.poc.spark.reporting.process.service import com.ebay.ep.poc.spark.reporting.process.util.DateUtil._ import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import collection.mutable.HashMap import com.databricks.spark.avro._ class HadoopSuccessEvents2Service extends ReportingService { override def execute(arguments: HashMap[String, String], sc: SparkContext) { val detail = "reporting.detail." + arguments.get("subcommand").get val startDate = arguments.get("startDate").get val endDate = arguments.get("endDate").get val input = arguments.get("input").get val output = arguments.get("output").get val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val successDetail_S1 = sqlContext.avroFile(input) successDetail_S1.registerTempTable("sojsuccessevents1") println("show tables") sqlContext.sql("show tables") println("show tables") sqlContext.sql("CREATE TABLE `sojsuccessevents2_spark`( `guid` string COMMENT 'from deserializer', `sessionkey` bigint COMMENT 'from deserializer', `sessionstartdate` string COMMENT 'from deserializer', `sojdatadate` string COMMENT 'from deserializer', `seqnum` int COMMENT 'from deserializer', `eventtimestamp` string COMMENT 'from deserializer', `siteid` int COMMENT 'from deserializer', `successeventtype` string COMMENT 'from deserializer', `sourcetype` string COMMENT 'from deserializer', `itemid` bigint COMMENT 'from deserializer', `shopcartid` bigint COMMENT 'from deserializer', `transactionid` bigint COMMENT 'from deserializer', `offerid` bigint COMMENT 'from deserializer', `userid` bigint COMMENT 'from deserializer', `priorpage1seqnum` int COMMENT 'from deserializer', `priorpage1pageid` int COMMENT 'from deserializer', `exclwmsearchattemptseqnum` int COMMENT 'from deserializer', `exclpriorsearchpageid` int COMMENT 'from deserializer', `exclpriorsearchseqnum` int COMMENT 'from deserializer', `exclpriorsearchcategory` int COMMENT 'from deserializer', `exclpriorsearchl1` int COMMENT 'from deserializer', `exclpriorsearchl2` int COMMENT 'fro
[jira] [Commented] (SPARK-6562) DataFrame.na.replace value support
[ https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387933#comment-14387933 ] Apache Spark commented on SPARK-6562: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5282 > DataFrame.na.replace value support > -- > > Key: SPARK-6562 > URL: https://issues.apache.org/jira/browse/SPARK-6562 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Support replacing a set of values with another set of values (i.e. map join), > similar to Pandas' replace. > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6562) DataFrame.na.replace value support
[ https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6562: --- Assignee: (was: Apache Spark) > DataFrame.na.replace value support > -- > > Key: SPARK-6562 > URL: https://issues.apache.org/jira/browse/SPARK-6562 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Support replacing a set of values with another set of values (i.e. map join), > similar to Pandas' replace. > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6562) DataFrame.na.replace value support
[ https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6562: --- Assignee: Apache Spark > DataFrame.na.replace value support > -- > > Key: SPARK-6562 > URL: https://issues.apache.org/jira/browse/SPARK-6562 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Support replacing a set of values with another set of values (i.e. map join), > similar to Pandas' replace. > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6562) DataFrame.na.replace value support
[ https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6562: --- Summary: DataFrame.na.replace value support (was: DataFrame.replace value support) > DataFrame.na.replace value support > -- > > Key: SPARK-6562 > URL: https://issues.apache.org/jira/browse/SPARK-6562 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Support replacing a set of values with another set of values (i.e. map join), > similar to Pandas' replace. > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6618: --- Assignee: Apache Spark (was: Yin Huai) > HiveMetastoreCatalog.lookupRelation should use fine-grained lock > > > Key: SPARK-6618 > URL: https://issues.apache.org/jira/browse/SPARK-6618 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) > and the scope of lock will cover resolving data source tables > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). > So, lookupRelation can be extremely expensive when we are doing expensive > operations like parquet schema discovery. So, we should use fine-grained lock > for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6618: --- Assignee: Yin Huai (was: Apache Spark) > HiveMetastoreCatalog.lookupRelation should use fine-grained lock > > > Key: SPARK-6618 > URL: https://issues.apache.org/jira/browse/SPARK-6618 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) > and the scope of lock will cover resolving data source tables > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). > So, lookupRelation can be extremely expensive when we are doing expensive > operations like parquet schema discovery. So, we should use fine-grained lock > for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387924#comment-14387924 ] Apache Spark commented on SPARK-6618: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5281 > HiveMetastoreCatalog.lookupRelation should use fine-grained lock > > > Key: SPARK-6618 > URL: https://issues.apache.org/jira/browse/SPARK-6618 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) > and the scope of lock will cover resolving data source tables > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). > So, lookupRelation can be extremely expensive when we are doing expensive > operations like parquet schema discovery. So, we should use fine-grained lock > for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6555) Override equals and hashCode in MetastoreRelation
[ https://issues.apache.org/jira/browse/SPARK-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-6555: - Assignee: Cheng Lian > Override equals and hashCode in MetastoreRelation > - > > Key: SPARK-6555 > URL: https://issues.apache.org/jira/browse/SPARK-6555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > This is a follow-up of SPARK-6450. > As explained in [this > comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499] > of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 > release. But overriding {{equals}} and {{hashCode}} is the proper fix to that > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387921#comment-14387921 ] Reynold Xin commented on SPARK-6573: Are numpy.nan turned into Double.NaN in the JVM? If yes, maybe we should consider all NaN numbers as null in the JVM. > expect pandas null values as numpy.nan (not only as None) > - > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Fabian Boehnlein > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > in () > > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) >1067 >1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1048 if type(obj) not in _acceptable_types[_type]: >1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) >1051 >1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6119. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Assignee: Reynold Xin > DataFrame.dropna support > > > Key: SPARK-6119 > URL: https://issues.apache.org/jira/browse/SPARK-6119 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: DataFrame > Fix For: 1.3.1, 1.4.0 > > > Support dropping rows with null values (dropna). Similar to Pandas' dropna > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6563) DataFrame.fillna
[ https://issues.apache.org/jira/browse/SPARK-6563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6563. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Assignee: Reynold Xin > DataFrame.fillna > > > Key: SPARK-6563 > URL: https://issues.apache.org/jira/browse/SPARK-6563 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.1, 1.4.0 > > > Support replacing all null value for a column (or all columns) with a fixed > value. > Similar to Pandas' fillna. > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop
[ https://issues.apache.org/jira/browse/SPARK-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6621: --- Assignee: Apache Spark > Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should > call onStop > -- > > Key: SPARK-6621 > URL: https://issues.apache.org/jira/browse/SPARK-6621 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop
[ https://issues.apache.org/jira/browse/SPARK-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387911#comment-14387911 ] Apache Spark commented on SPARK-6621: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5280 > Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should > call onStop > -- > > Key: SPARK-6621 > URL: https://issues.apache.org/jira/browse/SPARK-6621 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop
[ https://issues.apache.org/jira/browse/SPARK-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6621: --- Assignee: (was: Apache Spark) > Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should > call onStop > -- > > Key: SPARK-6621 > URL: https://issues.apache.org/jira/browse/SPARK-6621 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5456) Decimal Type comparison issue
[ https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387904#comment-14387904 ] Kuldeep commented on SPARK-5456: [~karthikg01] 1) Switch to hive context, I am not trying to deride the plain sql context, but the hive context is just better tested and has a well defined syntax borrowed from hive. 2) Even in hive context i have faced problems with bigdecimals, so like your workaround i also convert bigdecimals to a double (not int). And for all practical purposes it is more than enough. I have not seem many datasources with those types. rdbms maps `NUMERIC` type to bigdecimal in jdbc but you can always workaround this by have a simple map transformation before you register it to sql context. 2 cents. > Decimal Type comparison issue > - > > Key: SPARK-5456 > URL: https://issues.apache.org/jira/browse/SPARK-5456 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Kuldeep > > Not quite able to figure this out but here is a junit test to reproduce this, > in JavaAPISuite.java > {code:title=DecimalBug.java} > @Test > public void decimalQueryTest() { > List decimalTable = new ArrayList(); > decimalTable.add(RowFactory.create(new BigDecimal("1"), new > BigDecimal("2"))); > decimalTable.add(RowFactory.create(new BigDecimal("3"), new > BigDecimal("4"))); > JavaRDD rows = sc.parallelize(decimalTable); > List fields = new ArrayList(7); > fields.add(DataTypes.createStructField("a", > DataTypes.createDecimalType(), true)); > fields.add(DataTypes.createStructField("b", > DataTypes.createDecimalType(), true)); > sqlContext.applySchema(rows.rdd(), > DataTypes.createStructType(fields)).registerTempTable("foo"); > Assert.assertEquals(sqlContext.sql("select * from foo where a > > 0").collectAsList(), decimalTable); > } > {code} > Fails with > java.lang.ClassCastException: java.math.BigDecimal cannot be cast to > org.apache.spark.sql.types.Decimal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop
Shixiong Zhu created SPARK-6621: --- Summary: Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop Key: SPARK-6621 URL: https://issues.apache.org/jira/browse/SPARK-6621 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection
[ https://issues.apache.org/jira/browse/SPARK-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387898#comment-14387898 ] Apache Spark commented on SPARK-6620: - User 'vlyubin' has created a pull request for this issue: https://github.com/apache/spark/pull/5279 > Speed up toDF() and rdd() functions by constructing converters in > ScalaReflection > - > > Key: SPARK-6620 > URL: https://issues.apache.org/jira/browse/SPARK-6620 > Project: Spark > Issue Type: Improvement >Reporter: Volodymyr Lyubinets > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection
[ https://issues.apache.org/jira/browse/SPARK-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6620: --- Assignee: (was: Apache Spark) > Speed up toDF() and rdd() functions by constructing converters in > ScalaReflection > - > > Key: SPARK-6620 > URL: https://issues.apache.org/jira/browse/SPARK-6620 > Project: Spark > Issue Type: Improvement >Reporter: Volodymyr Lyubinets > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection
[ https://issues.apache.org/jira/browse/SPARK-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6620: --- Assignee: Apache Spark > Speed up toDF() and rdd() functions by constructing converters in > ScalaReflection > - > > Key: SPARK-6620 > URL: https://issues.apache.org/jira/browse/SPARK-6620 > Project: Spark > Issue Type: Improvement >Reporter: Volodymyr Lyubinets >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection
Volodymyr Lyubinets created SPARK-6620: -- Summary: Speed up toDF() and rdd() functions by constructing converters in ScalaReflection Key: SPARK-6620 URL: https://issues.apache.org/jira/browse/SPARK-6620 Project: Spark Issue Type: Improvement Reporter: Volodymyr Lyubinets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.
[ https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan closed SPARK-6606. Resolution: Duplicate Duplicate with SPARK-5360, see https://github.com/apache/spark/pull/4145 > Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd > object. > - > > Key: SPARK-6606 > URL: https://issues.apache.org/jira/browse/SPARK-6606 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: SuYan > > 1. Use code like belows, will found accumulator deserialized twice. > first: > {code} > task = ser.deserialize[Task[Any]](taskBytes, > Thread.currentThread.getContextClassLoader) > {code} > second: > {code} > val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( > ByteBuffer.wrap(taskBinary.value), > Thread.currentThread.getContextClassLoader) > {code} > which the first deserialized is not what expected. > because ResultTask or ShuffleMapTask will have a partition object. > in class > {code} > CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: > Partitioner) > {code}, the CogroupPartition may contains a CoGroupDep: > {code} > NarrowCoGroupSplitDep( > rdd: RDD[_], > splitIndex: Int, > var split: Partition > ) extends CoGroupSplitDep { > {code} > in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result > into the first deserialized. > example: > {code} >val acc1 = sc.accumulator(0, "test1") > val acc2 = sc.accumulator(0, "test2") > val rdd1 = sc.parallelize((1 to 10).toSeq, 3) > val rdd2 = sc.parallelize((1 to 10).toSeq, 3) > val combine1 = rdd1.map { case a => (a, 1)}.combineByKey(a => { > acc1 += 1 > a > }, (a: Int, b: Int) => { > a + b > }, > (a: Int, b: Int) => { > a + b > }, new HashPartitioner(3), mapSideCombine = false) > val combine2 = rdd2.map { case a => (a, 1)}.combineByKey( > a => { > acc2 += 1 > a > }, > (a: Int, b: Int) => { > a + b > }, > (a: Int, b: Int) => { > a + b > }, new HashPartitioner(3), mapSideCombine = false) > combine1.cogroup(combine2, new HashPartitioner(3)).count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387847#comment-14387847 ] Apache Spark commented on SPARK-5371: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/5278 > Failure to analyze query with UNION ALL and double aggregation > -- > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Michael Armbrust >Priority: Critical > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$ano
[jira] [Assigned] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5371: --- Assignee: Michael Armbrust (was: Apache Spark) > Failure to analyze query with UNION ALL and double aggregation > -- > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Michael Armbrust >Priority: Critical > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:3
[jira] [Assigned] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5371: --- Assignee: Apache Spark (was: Michael Armbrust) > Failure to analyze query with UNION ALL and double aggregation > -- > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Apache Spark >Priority: Critical > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[jira] [Updated] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5371: Summary: Failure to analyze query with UNION ALL and double aggregation (was: SparkSQL Fails to analyze Query with UNION ALL in subquery) > Failure to analyze query with UNION ALL and double aggregation > -- > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Michael Armbrust >Priority: Critical > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.
[jira] [Assigned] (SPARK-5371) SparkSQL Fails to parse Query with UNION ALL in subquery
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-5371: --- Assignee: Michael Armbrust > SparkSQL Fails to parse Query with UNION ALL in subquery > > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Michael Armbrust > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(It
[jira] [Updated] (SPARK-5371) SparkSQL Fails to parse Query with UNION ALL in subquery
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5371: Priority: Critical (was: Major) Target Version/s: 1.3.1 Affects Version/s: 1.2.0 1.3.0 > SparkSQL Fails to parse Query with UNION ALL in subquery > > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Michael Armbrust >Priority: Critical > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:16
[jira] [Updated] (SPARK-5371) SparkSQL Fails to analyze Query with UNION ALL in subquery
[ https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5371: Summary: SparkSQL Fails to analyze Query with UNION ALL in subquery (was: SparkSQL Fails to parse Query with UNION ALL in subquery) > SparkSQL Fails to analyze Query with UNION ALL in subquery > -- > > Key: SPARK-5371 > URL: https://issues.apache.org/jira/browse/SPARK-5371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: David Ross >Assignee: Michael Armbrust >Priority: Critical > > This SQL session: > {code} > DROP TABLE > test1; > DROP TABLE > test2; > CREATE TABLE > test1 > ( > c11 INT, > c12 INT, > c13 INT, > c14 INT > ); > CREATE TABLE > test2 > ( > c21 INT, > c22 INT, > c23 INT, > c24 INT > ); > SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3; > {code} > Produces this error: > {code} > 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3' > 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT > MIN(t3.c_1), > MIN(t3.c_2), > MIN(t3.c_3), > MIN(t3.c_4) > FROM > ( > SELECT > SUM(t1.c11) c_1, > NULLc_2, > NULLc_3, > NULLc_4 > FROM > test1 t1 > UNION ALL > SELECT > NULLc_1, > SUM(t2.c22) c_2, > SUM(t2.c23) c_3, > SUM(t2.c24) c_4 > FROM > test2 t2 ) t3 > 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed > 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query: > java.util.NoSuchElementException: key not found: c_2#23488 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98) > at > org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode
[jira] [Resolved] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus resolved SPARK-6605. - Resolution: Won't Fix {{reduceByKeyAndWindow }} has two implementations and leads to two different result when coming an empty window. But we consider it as a difference not a problem. If user wants to remove the empty keys using {{ReducedWindowedDStream}}, he can have a {{filter}} function to remove it. > Same transformation in DStream leads to different result > > > Key: SPARK-6605 > URL: https://issues.apache.org/jira/browse/SPARK-6605 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: SaintBacchus > Fix For: 1.4.0 > > > The transformation *reduceByKeyAndWindow* has two implementations: one use > the *WindowDstream* and the other use *ReducedWindowedDStream*. > But the result always is the same, except when an empty windows occurs. > As a wordcount example, if a period of time (larger than window time) has no > data coming, the first *reduceByKeyAndWindow* has no elem inside but the > second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387811#comment-14387811 ] SaintBacchus edited comment on SPARK-6605 at 3/31/15 1:54 AM: -- {{reduceByKeyAndWindow}} has two implementations and leads to two different result when coming an empty window. But we consider it as a difference not a problem. If user wants to remove the empty keys using {{ReducedWindowedDStream}}, he can have a {{filter}} function to remove it. was (Author: carlmartin): {{reduceByKeyAndWindow }} has two implementations and leads to two different result when coming an empty window. But we consider it as a difference not a problem. If user wants to remove the empty keys using {{ReducedWindowedDStream}}, he can have a {{filter}} function to remove it. > Same transformation in DStream leads to different result > > > Key: SPARK-6605 > URL: https://issues.apache.org/jira/browse/SPARK-6605 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: SaintBacchus > Fix For: 1.4.0 > > > The transformation *reduceByKeyAndWindow* has two implementations: one use > the *WindowDstream* and the other use *ReducedWindowedDStream*. > But the result always is the same, except when an empty windows occurs. > As a wordcount example, if a period of time (larger than window time) has no > data coming, the first *reduceByKeyAndWindow* has no elem inside but the > second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6619) Improve Jar caching on executors
[ https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387783#comment-14387783 ] Mingyu Kim commented on SPARK-6619: --- [~li-zhihui], [~joshrosen] since you worked on SPARK-2713. I'll prepare a PR in the next couple of days, but wanted to your thoughts in the meantime. > Improve Jar caching on executors > > > Key: SPARK-6619 > URL: https://issues.apache.org/jira/browse/SPARK-6619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Mingyu Kim > > Taking SPARK-2713 one step further so that > - The cached jars can be used by multiple applications. In order to do that, > I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. > - The cached jars are hard-linked to the work directory as opposed to being > copied. > Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for > 158 jars with the total size of 56MB, and this takes ~10s to ship to the > executor at the start-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6619) Improve Jar caching on executors
[ https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingyu Kim updated SPARK-6619: -- Description: Taking SPARK-2713 one step further so that - The cached jars can be used by multiple applications. In order to do that, I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. - The cached jars are hard-linked to the work directory as opposed to being copied. Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 158 jars with the total size of 56MB, and this takes ~10s to ship to the executor at the start-up. was: Taking SPARK-2713 one step further so that the cached jars can be used by multiple applications. In order to do that, I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 158 jars with the total size of 56MB, and this takes 5~10s to Summary: Improve Jar caching on executors (was: Jar cache on Executors should use file content hash as the key) > Improve Jar caching on executors > > > Key: SPARK-6619 > URL: https://issues.apache.org/jira/browse/SPARK-6619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Mingyu Kim > > Taking SPARK-2713 one step further so that > - The cached jars can be used by multiple applications. In order to do that, > I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. > - The cached jars are hard-linked to the work directory as opposed to being > copied. > Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for > 158 jars with the total size of 56MB, and this takes ~10s to ship to the > executor at the start-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6619) Jar cache on Executors should use file content hash as the key
[ https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingyu Kim updated SPARK-6619: -- Description: Taking SPARK-2713 one step further so that the cached jars can be used by multiple applications. In order to do that, I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 158 jars with the total size of 56MB, and this takes 5~10s to was:Taking SPARK-2713 one step further so that the cached jars can be used by multiple applications. In order to do that, I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. > Jar cache on Executors should use file content hash as the key > -- > > Key: SPARK-6619 > URL: https://issues.apache.org/jira/browse/SPARK-6619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Mingyu Kim > > Taking SPARK-2713 one step further so that the cached jars can be used by > multiple applications. In order to do that, I'm planning to use MD5 as the > cache key as opposed to url hash and timestamp. > Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for > 158 jars with the total size of 56MB, and this takes 5~10s to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6619) Jar cache on Executors should use file content hash as the key
Mingyu Kim created SPARK-6619: - Summary: Jar cache on Executors should use file content hash as the key Key: SPARK-6619 URL: https://issues.apache.org/jira/browse/SPARK-6619 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Mingyu Kim Taking SPARK-2713 one step further so that the cached jars can be used by multiple applications. In order to do that, I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead
[ https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387748#comment-14387748 ] Littlestar edited comment on SPARK-6239 at 3/31/15 1:09 AM: >>>I would imagine a relative value is more usually useful. when recnum=12345678, minsupport=0.003, recnum*minsupport near to integer. Some result with little difference is lost because of double precision. was (Author: cnstar9988): >>If I want to set minCount=2, I must use.setMinSupport(1.99/(rdd.count())), >>because of double's precision. How to reopen this PR and mark relation to pull/5246, thanks. > Spark MLlib fpm#FPGrowth minSupport should use long instead > --- > > Key: SPARK-6239 > URL: https://issues.apache.org/jira/browse/SPARK-6239 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Littlestar >Priority: Minor > > Spark MLlib fpm#FPGrowth minSupport should use long instead > == > val minCount = math.ceil(minSupport * count).toLong > because: > 1. [count]numbers of datasets is not kown before read. > 2. [minSupport ]double precision. > from mahout#FPGrowthDriver.java > addOption("minSupport", "s", "(Optional) The minimum number of times a > co-occurrence must be present." > + " Default Value: 3", "3"); > I just want to set minCount=2 for test. > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead
[ https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387748#comment-14387748 ] Littlestar commented on SPARK-6239: --- >>If I want to set minCount=2, I must use.setMinSupport(1.99/(rdd.count())), >>because of double's precision. How to reopen this PR and mark relation to pull/5246, thanks. > Spark MLlib fpm#FPGrowth minSupport should use long instead > --- > > Key: SPARK-6239 > URL: https://issues.apache.org/jira/browse/SPARK-6239 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Littlestar >Priority: Minor > > Spark MLlib fpm#FPGrowth minSupport should use long instead > == > val minCount = math.ceil(minSupport * count).toLong > because: > 1. [count]numbers of datasets is not kown before read. > 2. [minSupport ]double precision. > from mahout#FPGrowthDriver.java > addOption("minSupport", "s", "(Optional) The minimum number of times a > co-occurrence must be present." > + " Default Value: 3", "3"); > I just want to set minCount=2 for test. > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387738#comment-14387738 ] Yin Huai commented on SPARK-6618: - cc [~marmbrus] and [~lian cheng]. I am going to address this today. > HiveMetastoreCatalog.lookupRelation should use fine-grained lock > > > Key: SPARK-6618 > URL: https://issues.apache.org/jira/browse/SPARK-6618 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) > and the scope of lock will cover resolving data source tables > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). > So, lookupRelation can be extremely expensive when we are doing expensive > operations like parquet schema discovery. So, we should use fine-grained lock > for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6618: Target Version/s: 1.3.1 (was: 1.3.0) > HiveMetastoreCatalog.lookupRelation should use fine-grained lock > > > Key: SPARK-6618 > URL: https://issues.apache.org/jira/browse/SPARK-6618 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) > and the scope of lock will cover resolving data source tables > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). > So, lookupRelation can be extremely expensive when we are doing expensive > operations like parquet schema discovery. So, we should use fine-grained lock > for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
Yin Huai created SPARK-6618: --- Summary: HiveMetastoreCatalog.lookupRelation should use fine-grained lock Key: SPARK-6618 URL: https://issues.apache.org/jira/browse/SPARK-6618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) and the scope of lock will cover resolving data source tables (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). So, lookupRelation can be extremely expensive when we are doing expensive operations like parquet schema discovery. So, we should use fine-grained lock for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6617) Word2Vec is nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6617: - Summary: Word2Vec is nondeterministic (was: Word2Vec is not deterministic) > Word2Vec is nondeterministic > > > Key: SPARK-6617 > URL: https://issues.apache.org/jira/browse/SPARK-6617 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Priority: Minor > > Word2Vec uses repartition: > https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L291, > which doesn't provide deterministic ordering. This makes QA a little harder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6617) Word2Vec is not deterministic
Xiangrui Meng created SPARK-6617: Summary: Word2Vec is not deterministic Key: SPARK-6617 URL: https://issues.apache.org/jira/browse/SPARK-6617 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Priority: Minor Word2Vec uses repartition: https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L291, which doesn't provide deterministic ordering. This makes QA a little harder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6369. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5139 [https://github.com/apache/spark/pull/5139] > InsertIntoHiveTable and Parquet Relation should use logic from > SparkHadoopWriter > > > Key: SPARK-6369 > URL: https://issues.apache.org/jira/browse/SPARK-6369 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.3.1, 1.4.0 > > > Right now it is possible that we will corrupt the output if there is a race > between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.
[ https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-6616: Description: There are numerous instances throughout the code base of the following: {code} if (!stopped) { stopped = true ... } {code} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {code}SparkContext.stop() {code}. A cursory examination reveals this in {code}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {code}. was: There are numerous instances throughout the code base of the following: {code} if (!stopped) { stopped = true ... } {code} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop() {{code}}. A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {{code}}. > IsStopped set to true in before stop() is complete. > --- > > Key: SPARK-6616 > URL: https://issues.apache.org/jira/browse/SPARK-6616 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ilya Ganelin > > There are numerous instances throughout the code base of the following: > {code} > if (!stopped) { > stopped = true > ... > } > {code} > In general, this is bad practice since it can cause an incomplete cleanup if > there is an error during shutdown and not all code executes. Incomplete > cleanup is harder to track down than a double cleanup that triggers some > error. I propose fixing this throughout the code, starting with the cleanup > sequence with {code}SparkContext.stop() {code}. > A cursory examination reveals this in {code}SparkContext.stop(), > SparkEnv.stop(), and ContextCleaner.stop() {code}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.
[ https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-6616: Description: There are numerous instances throughout the code base of the following: {code} if (!stopped) { stopped = true ... } {code} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {code}SparkContext.stop() {code} A cursory examination reveals this in {code}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {code} was: There are numerous instances throughout the code base of the following: {code} if (!stopped) { stopped = true ... } {code} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {code}SparkContext.stop() {code}. A cursory examination reveals this in {code}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {code}. > IsStopped set to true in before stop() is complete. > --- > > Key: SPARK-6616 > URL: https://issues.apache.org/jira/browse/SPARK-6616 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ilya Ganelin > > There are numerous instances throughout the code base of the following: > {code} > if (!stopped) { > stopped = true > ... > } > {code} > In general, this is bad practice since it can cause an incomplete cleanup if > there is an error during shutdown and not all code executes. Incomplete > cleanup is harder to track down than a double cleanup that triggers some > error. I propose fixing this throughout the code, starting with the cleanup > sequence with {code}SparkContext.stop() {code} > A cursory examination reveals this in {code}SparkContext.stop(), > SparkEnv.stop(), and ContextCleaner.stop() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.
[ https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-6616: Description: There are numerous instances throughout the code base of the following: {code} if (!stopped) { stopped = true ... } {code} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop() {{code}}. A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {{code}}. was: There are numerous instances throughout the code base of the following: {{code}} if (!stopped) { stopped = true ... } {{code}} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop() {{code}}. A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {{code}}. > IsStopped set to true in before stop() is complete. > --- > > Key: SPARK-6616 > URL: https://issues.apache.org/jira/browse/SPARK-6616 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ilya Ganelin > > There are numerous instances throughout the code base of the following: > {code} > if (!stopped) { > stopped = true > ... > } > {code} > In general, this is bad practice since it can cause an incomplete cleanup if > there is an error during shutdown and not all code executes. Incomplete > cleanup is harder to track down than a double cleanup that triggers some > error. I propose fixing this throughout the code, starting with the cleanup > sequence with {{code}}SparkContext.stop() {{code}}. > A cursory examination reveals this in {{code}}SparkContext.stop(), > SparkEnv.stop(), and ContextCleaner.stop() {{code}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.
[ https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-6616: Description: There are numerous instances throughout the code base of the following: {{code}} if (!stopped) { stopped = true ... } {{code}} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop() {{code}}. A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop() {{code}}. was: There are numerous instances throughout the code base of the following: {{code}} if (!stopped) { stopped = true ... } {{code}} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop().{{code}} A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop().{{code}} > IsStopped set to true in before stop() is complete. > --- > > Key: SPARK-6616 > URL: https://issues.apache.org/jira/browse/SPARK-6616 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ilya Ganelin > > There are numerous instances throughout the code base of the following: > {{code}} > if (!stopped) { > stopped = true > ... > } > {{code}} > In general, this is bad practice since it can cause an incomplete cleanup if > there is an error during shutdown and not all code executes. Incomplete > cleanup is harder to track down than a double cleanup that triggers some > error. I propose fixing this throughout the code, starting with the cleanup > sequence with {{code}}SparkContext.stop() {{code}}. > A cursory examination reveals this in {{code}}SparkContext.stop(), > SparkEnv.stop(), and ContextCleaner.stop() {{code}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.
[ https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-6616: Description: There are numerous instances throughout the code base of the following: {{code}} if (!stopped) { stopped = true ... } {{code}} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop().{{code}} A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop().{{code}} was: There are numerous instances throughout the code base of the following: {{code}} if (!stopped) { stopped = true ... } {{code}} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop()```.{{code}} A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop().{{code}} > IsStopped set to true in before stop() is complete. > --- > > Key: SPARK-6616 > URL: https://issues.apache.org/jira/browse/SPARK-6616 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ilya Ganelin > > There are numerous instances throughout the code base of the following: > {{code}} > if (!stopped) { > stopped = true > ... > } > {{code}} > In general, this is bad practice since it can cause an incomplete cleanup if > there is an error during shutdown and not all code executes. Incomplete > cleanup is harder to track down than a double cleanup that triggers some > error. I propose fixing this throughout the code, starting with the cleanup > sequence with {{code}}SparkContext.stop().{{code}} > A cursory examination reveals this in {{code}}SparkContext.stop(), > SparkEnv.stop(), and ContextCleaner.stop().{{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.
[ https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-6616: Description: There are numerous instances throughout the code base of the following: {{code}} if (!stopped) { stopped = true ... } {{code}} In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with {{code}}SparkContext.stop()```.{{code}} A cursory examination reveals this in {{code}}SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop().{{code}} was: There are numerous instances throughout the code base of the following: ``` if (!stopped) { stopped = true ... } ``` In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with ```SparkContext.stop()```. A cursory examination reveals this in ```SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop()```. > IsStopped set to true in before stop() is complete. > --- > > Key: SPARK-6616 > URL: https://issues.apache.org/jira/browse/SPARK-6616 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ilya Ganelin > > There are numerous instances throughout the code base of the following: > {{code}} > if (!stopped) { > stopped = true > ... > } > {{code}} > In general, this is bad practice since it can cause an incomplete cleanup if > there is an error during shutdown and not all code executes. Incomplete > cleanup is harder to track down than a double cleanup that triggers some > error. I propose fixing this throughout the code, starting with the cleanup > sequence with {{code}}SparkContext.stop()```.{{code}} > A cursory examination reveals this in {{code}}SparkContext.stop(), > SparkEnv.stop(), and ContextCleaner.stop().{{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6616) IsStopped set to true in before stop() is complete.
Ilya Ganelin created SPARK-6616: --- Summary: IsStopped set to true in before stop() is complete. Key: SPARK-6616 URL: https://issues.apache.org/jira/browse/SPARK-6616 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Ilya Ganelin There are numerous instances throughout the code base of the following: ``` if (!stopped) { stopped = true ... } ``` In general, this is bad practice since it can cause an incomplete cleanup if there is an error during shutdown and not all code executes. Incomplete cleanup is harder to track down than a double cleanup that triggers some error. I propose fixing this throughout the code, starting with the cleanup sequence with ```SparkContext.stop()```. A cursory examination reveals this in ```SparkContext.stop(), SparkEnv.stop(), and ContextCleaner.stop()```. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6615) Python API for Word2Vec
Kai Sasaki created SPARK-6615: - Summary: Python API for Word2Vec Key: SPARK-6615 URL: https://issues.apache.org/jira/browse/SPARK-6615 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
[ https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387568#comment-14387568 ] Josh Rosen commented on SPARK-6492: --- Timeouts are one way to fix this, but I wonder if we could also try to remove the circular wait condition by modifying EventLoop so that {{stopped}} is set before we call {{onError}}. This would prevent calls to {{EventLoop.stop()}} from blocking while the event loop is in the process of shutting down, which should prevent this race. > SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies > --- > > Key: SPARK-6492 > URL: https://issues.apache.org/jira/browse/SPARK-6492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Josh Rosen >Priority: Critical > > A deadlock can occur when DAGScheduler death causes a SparkContext to be shut > down while user code is concurrently racing to stop the SparkContext in a > finally block. > For example: > {code} > try { > sc = new SparkContext("local", "test") > // start running a job that causes the DAGSchedulerEventProcessor to > crash > someRDD.doStuff() > } > } finally { > sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes > the above job to fail with an exception > } > {code} > This leads to a deadlock. The event processor thread tries to lock on the > {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because > the thread that holds that lock is waiting for the event processor thread to > join: > {code} > "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 > waiting for monitor entry [0x0001223ad000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) > - waiting to lock <0x0007f5037b08> (a java.lang.Object) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52) > {code} > {code} > "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 > tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1355) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1405) > - locked <0x0007f5037b08> (a java.lang.Object) > [...] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5886) Add LabelIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387567#comment-14387567 ] Joseph K. Bradley commented on SPARK-5886: -- Also, should this index native types other than Strings? It seems a shame to need a new class for other types like Double and Int as well. Maybe the 2 distinctions we need are: * This class indexes native types. * [SPARK-4081] indexes Vector and Array types. > Add LabelIndexer > > > Key: SPARK-5886 > URL: https://issues.apache.org/jira/browse/SPARK-5886 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > `LabelIndexer` takes a column of labels (raw categories) and outputs an > integer column with labels indexed by their frequency. > {code} > va li = new LabelIndexer() > .setInputCol("country") > .setOutputCol("countryIndex") > {code} > In the output column, we should store the label to index map as an ML > attribute. The index should be ordered by frequency, where the most frequent > label gets index 0, to enhance sparsity. > We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
[ https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6492: --- Assignee: (was: Apache Spark) > SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies > --- > > Key: SPARK-6492 > URL: https://issues.apache.org/jira/browse/SPARK-6492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Josh Rosen >Priority: Critical > > A deadlock can occur when DAGScheduler death causes a SparkContext to be shut > down while user code is concurrently racing to stop the SparkContext in a > finally block. > For example: > {code} > try { > sc = new SparkContext("local", "test") > // start running a job that causes the DAGSchedulerEventProcessor to > crash > someRDD.doStuff() > } > } finally { > sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes > the above job to fail with an exception > } > {code} > This leads to a deadlock. The event processor thread tries to lock on the > {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because > the thread that holds that lock is waiting for the event processor thread to > join: > {code} > "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 > waiting for monitor entry [0x0001223ad000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) > - waiting to lock <0x0007f5037b08> (a java.lang.Object) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52) > {code} > {code} > "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 > tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1355) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1405) > - locked <0x0007f5037b08> (a java.lang.Object) > [...] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI
[ https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5205: --- Assignee: Apache Spark > Inconsistent behaviour between Streaming job and others, when click kill link > in WebUI > -- > > Key: SPARK-5205 > URL: https://issues.apache.org/jira/browse/SPARK-5205 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: uncleGen >Assignee: Apache Spark > > The "kill" link is used to kill a stage in job. It works in any kinds of > Spark job but Spark Streaming. To be specific, we can only kill the stage > which is used to run "Receiver", but not kill the "Receivers". Well, the > stage can be killed and cleaned from the ui, but the receivers are still > alive and receiving data. I think it dose not fit with the common sense. > IMHO, killing the "receiver" stage means kill the "receivers" and stopping > receiving data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI
[ https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5205: --- Assignee: (was: Apache Spark) > Inconsistent behaviour between Streaming job and others, when click kill link > in WebUI > -- > > Key: SPARK-5205 > URL: https://issues.apache.org/jira/browse/SPARK-5205 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: uncleGen > > The "kill" link is used to kill a stage in job. It works in any kinds of > Spark job but Spark Streaming. To be specific, we can only kill the stage > which is used to run "Receiver", but not kill the "Receivers". Well, the > stage can be killed and cleaned from the ui, but the receivers are still > alive and receiving data. I think it dose not fit with the common sense. > IMHO, killing the "receiver" stage means kill the "receivers" and stopping > receiving data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
[ https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6492: --- Assignee: Apache Spark > SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies > --- > > Key: SPARK-6492 > URL: https://issues.apache.org/jira/browse/SPARK-6492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Critical > > A deadlock can occur when DAGScheduler death causes a SparkContext to be shut > down while user code is concurrently racing to stop the SparkContext in a > finally block. > For example: > {code} > try { > sc = new SparkContext("local", "test") > // start running a job that causes the DAGSchedulerEventProcessor to > crash > someRDD.doStuff() > } > } finally { > sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes > the above job to fail with an exception > } > {code} > This leads to a deadlock. The event processor thread tries to lock on the > {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because > the thread that holds that lock is waiting for the event processor thread to > join: > {code} > "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 > waiting for monitor entry [0x0001223ad000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) > - waiting to lock <0x0007f5037b08> (a java.lang.Object) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52) > {code} > {code} > "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 > tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1355) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1405) > - locked <0x0007f5037b08> (a java.lang.Object) > [...] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
[ https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387558#comment-14387558 ] Apache Spark commented on SPARK-6492: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5277 > SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies > --- > > Key: SPARK-6492 > URL: https://issues.apache.org/jira/browse/SPARK-6492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Josh Rosen >Priority: Critical > > A deadlock can occur when DAGScheduler death causes a SparkContext to be shut > down while user code is concurrently racing to stop the SparkContext in a > finally block. > For example: > {code} > try { > sc = new SparkContext("local", "test") > // start running a job that causes the DAGSchedulerEventProcessor to > crash > someRDD.doStuff() > } > } finally { > sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes > the above job to fail with an exception > } > {code} > This leads to a deadlock. The event processor thread tries to lock on the > {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because > the thread that holds that lock is waiting for the event processor thread to > join: > {code} > "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 > waiting for monitor entry [0x0001223ad000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) > - waiting to lock <0x0007f5037b08> (a java.lang.Object) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52) > {code} > {code} > "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 > tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x0007f4b28000> (a > org.apache.spark.util.EventLoop$$anon$1) > at java.lang.Thread.join(Thread.java:1355) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1405) > - locked <0x0007f5037b08> (a java.lang.Object) > [...] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5886) Add LabelIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387554#comment-14387554 ] Joseph K. Bradley commented on SPARK-5886: -- Was there any discussion about this indexing multiple columns at the same time? I think it should be able to since it sounds easier and more efficient when indexing many columns. > Add LabelIndexer > > > Key: SPARK-5886 > URL: https://issues.apache.org/jira/browse/SPARK-5886 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > `LabelIndexer` takes a column of labels (raw categories) and outputs an > integer column with labels indexed by their frequency. > {code} > va li = new LabelIndexer() > .setInputCol("country") > .setOutputCol("countryIndex") > {code} > In the output column, we should store the label to index map as an ML > attribute. The index should be ordered by frequency, where the most frequent > label gets index 0, to enhance sparsity. > We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6603) SQLContext.registerFunction -> SQLContext.udf.register
[ https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6603. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 > SQLContext.registerFunction -> SQLContext.udf.register > -- > > Key: SPARK-6603 > URL: https://issues.apache.org/jira/browse/SPARK-6603 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > Fix For: 1.3.1, 1.4.0 > > > We didn't change the Python implementation to use that. Maybe the best > strategy is to deprecate SQLContext.registerFunction, and just add > SQLContext.udf.register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387541#comment-14387541 ] Joseph K. Bradley commented on SPARK-6251: -- I'm closing this since we need to revamp the optimization API anyways. > Mark parts of LBFGS, GradientDescent as DeveloperApi > > > Key: SPARK-6251 > URL: https://issues.apache.org/jira/browse/SPARK-6251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > > Should be DeveloperApi: > * optimize > * setGradient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-6251. Resolution: Won't Fix > Mark parts of LBFGS, GradientDescent as DeveloperApi > > > Key: SPARK-6251 > URL: https://issues.apache.org/jira/browse/SPARK-6251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > > Should be DeveloperApi: > * optimize > * setGradient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6251: --- Assignee: Joseph K. Bradley (was: Apache Spark) > Mark parts of LBFGS, GradientDescent as DeveloperApi > > > Key: SPARK-6251 > URL: https://issues.apache.org/jira/browse/SPARK-6251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > > Should be DeveloperApi: > * optimize > * setGradient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6251: --- Assignee: Apache Spark (was: Joseph K. Bradley) > Mark parts of LBFGS, GradientDescent as DeveloperApi > > > Key: SPARK-6251 > URL: https://issues.apache.org/jira/browse/SPARK-6251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Trivial > > Should be DeveloperApi: > * optimize > * setGradient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure
[ https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6614: --- Assignee: Apache Spark (was: Josh Rosen) > OutputCommitCoordinator should clear authorized committers only after > authorized committer fails, not after any failure > --- > > Key: SPARK-6614 > URL: https://issues.apache.org/jira/browse/SPARK-6614 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > In OutputCommitCoordinator, there is some logic to clear the authorized > committer's lock on committing in case it fails. However, it looks like the > current code also clears this lock if _other_ tasks fail, which is an obvious > bug: > https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118. > In theory, it's possible that this could allow a new committer to start, > run to completion, and commit output before the authorized committer > finished, but it's unlikely that this race occurs often in practice due to > the complex combination of failure and timing conditions that would be > required to expose it. Still, we should fix this issue. > This was discovered by [~adav] while reading the OutputCommitCoordinator code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure
[ https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6614: --- Assignee: Josh Rosen (was: Apache Spark) > OutputCommitCoordinator should clear authorized committers only after > authorized committer fails, not after any failure > --- > > Key: SPARK-6614 > URL: https://issues.apache.org/jira/browse/SPARK-6614 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In OutputCommitCoordinator, there is some logic to clear the authorized > committer's lock on committing in case it fails. However, it looks like the > current code also clears this lock if _other_ tasks fail, which is an obvious > bug: > https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118. > In theory, it's possible that this could allow a new committer to start, > run to completion, and commit output before the authorized committer > finished, but it's unlikely that this race occurs often in practice due to > the complex combination of failure and timing conditions that would be > required to expose it. Still, we should fix this issue. > This was discovered by [~adav] while reading the OutputCommitCoordinator code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure
[ https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387534#comment-14387534 ] Apache Spark commented on SPARK-6614: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/5276 > OutputCommitCoordinator should clear authorized committers only after > authorized committer fails, not after any failure > --- > > Key: SPARK-6614 > URL: https://issues.apache.org/jira/browse/SPARK-6614 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In OutputCommitCoordinator, there is some logic to clear the authorized > committer's lock on committing in case it fails. However, it looks like the > current code also clears this lock if _other_ tasks fail, which is an obvious > bug: > https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118. > In theory, it's possible that this could allow a new committer to start, > run to completion, and commit output before the authorized committer > finished, but it's unlikely that this race occurs often in practice due to > the complex combination of failure and timing conditions that would be > required to expose it. Still, we should fix this issue. > This was discovered by [~adav] while reading the OutputCommitCoordinator code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387532#comment-14387532 ] Apache Spark commented on SPARK-2883: - User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/5275 > Spark Support for ORCFile format > > > Key: SPARK-2883 > URL: https://issues.apache.org/jira/browse/SPARK-2883 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Reporter: Zhan Zhang >Priority: Blocker > Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 > pm jobtracker.png, orc.diff > > > Verify the support of OrcInputFormat in spark, fix issues if exists and add > documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure
[ https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6614: -- Affects Version/s: 1.4.0 1.3.1 > OutputCommitCoordinator should clear authorized committers only after > authorized committer fails, not after any failure > --- > > Key: SPARK-6614 > URL: https://issues.apache.org/jira/browse/SPARK-6614 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In OutputCommitCoordinator, there is some logic to clear the authorized > committer's lock on committing in case it fails. However, it looks like the > current code also clears this lock if _other_ tasks fail, which is an obvious > bug: > https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118. > In theory, it's possible that this could allow a new committer to start, > run to completion, and commit output before the authorized committer > finished, but it's unlikely that this race occurs often in practice due to > the complex combination of failure and timing conditions that would be > required to expose it. Still, we should fix this issue. > This was discovered by [~adav] while reading the OutputCommitCoordinator code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org