[jira] [Commented] (SPARK-11728) Replace example code in ml-ensembles.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005334#comment-15005334 ] Apache Spark commented on SPARK-11728: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/9716 > Replace example code in ml-ensembles.md using include_example > - > > Key: SPARK-11728 > URL: https://issues.apache.org/jira/browse/SPARK-11728 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11728) Replace example code in ml-ensembles.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11728: Assignee: (was: Apache Spark) > Replace example code in ml-ensembles.md using include_example > - > > Key: SPARK-11728 > URL: https://issues.apache.org/jira/browse/SPARK-11728 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
[ https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005249#comment-15005249 ] Xin Wu commented on SPARK-10673: if the default is false, {code} if (!sc.conf.verifyPartitionPath) { partitionToDeserializer } {code} will not get into the code path you mentioned. What the problem is that when the property is set to true, then, it gets into the code path that potentially evaluates all partitions of the table that matches the pathPatternStr. However, the pathPatternStr is computed as "/pathToTable/*/*/.." depending on the number of partition columns. Basically, what it means is to validate the desired partition path against all existing partition paths, including nested directories, which may be a lot.. So to avoid this potential performance issue.. I think we maybe able to simply the code in the else block of function verifyPartitionPath(). I am working on a fix. > spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions > - > > Key: SPARK-10673 > URL: https://issues.apache.org/jira/browse/SPARK-10673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.5.0 >Reporter: Miklos Christine >Priority: Minor > > In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. > In Spark 1.5, it is now set to false by default. > If a table has a lot of partitions in the underlying filesystem, the code > unnecessarily checks for all the underlying directories when executing a > query. > https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162 > Structure: > {code} > /user/hive/warehouse/table1/year=2015/month=01/ > /user/hive/warehouse/table1/year=2015/month=02/ > /user/hive/warehouse/table1/year=2015/month=03/ > ... > /user/hive/warehouse/table1/year=2014/month=01/ > /user/hive/warehouse/table1/year=2014/month=02/ > {code} > If the registered partitions only contain year=2015 when you run "show > partitions table1", this code path checks for all directories under the > table's root directory. This incurs a significant performance penalty if > there are a lot of partition directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0
[ https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005292#comment-15005292 ] Bartlomiej Alberski commented on SPARK-11553: - Thanks - good to know > row.getInt(i) if row[i]=null returns 0 > -- > > Key: SPARK-11553 > URL: https://issues.apache.org/jira/browse/SPARK-11553 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tofigh >Priority: Minor > > row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even > according to the document they should throw nullException error) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005355#comment-15005355 ] mustafa elbehery commented on SPARK-5226: - Hello, I would like to use DBSCAN on spark. [~alitouka] I have tried to use ur implementation, on 500 MG of data. However, I think the **Population of partition index** step is to expensive. Is this implementation is going to be online soon, Regards. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11337) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005338#comment-15005338 ] Xusen Yin commented on SPARK-11337: --- [~mengxr] Until now, all docs of ML and MLlib packages are substituted with include_example except for the DataTypes and BasicStatistics pages. As we talked before, these two files are dependent on the SPARK-11399. After we finished all the replacements of docs, I think we need to sweep the example codes again since there are some trivial issues in some of the codes. > Make example code in user guide testable > > > Key: SPARK-11337 > URL: https://issues.apache.org/jira/browse/SPARK-11337 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Reporter: Xiangrui Meng >Assignee: Xusen Yin >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala ml.KMeansExample guide %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "example" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Sub-tasks are created to move example code from user guide to `examples/`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11573) correct 'reflective access of structural type member method should be enabled' Scala warnings
[ https://issues.apache.org/jira/browse/SPARK-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11573. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9550 [https://github.com/apache/spark/pull/9550] > correct 'reflective access of structural type member method should be > enabled' Scala warnings > - > > Key: SPARK-11573 > URL: https://issues.apache.org/jira/browse/SPARK-11573 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Gabor Liptak >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11573) correct 'reflective access of structural type member method should be enabled' Scala warnings
[ https://issues.apache.org/jira/browse/SPARK-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11573: -- Assignee: Gabor Liptak Priority: Trivial (was: Minor) Description: was: > correct 'reflective access of structural type member method should be > enabled' Scala warnings > - > > Key: SPARK-11573 > URL: https://issues.apache.org/jira/browse/SPARK-11573 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Gabor Liptak >Assignee: Gabor Liptak >Priority: Trivial > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0
[ https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005270#comment-15005270 ] Bartlomiej Alberski commented on SPARK-11553: - Please assign me to this issue as I already prepare PR > row.getInt(i) if row[i]=null returns 0 > -- > > Key: SPARK-11553 > URL: https://issues.apache.org/jira/browse/SPARK-11553 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tofigh >Priority: Minor > > row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even > according to the document they should throw nullException error) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11694) Parquet logical types are not being tested properly
[ https://issues.apache.org/jira/browse/SPARK-11694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11694. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9660 [https://github.com/apache/spark/pull/9660] > Parquet logical types are not being tested properly > --- > > Key: SPARK-11694 > URL: https://issues.apache.org/jira/browse/SPARK-11694 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 1.6.0 > > > All the physical types are properly tested at {{ParquetIOSuite}} but logical > type mapping is not being tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11694) Parquet logical types are not being tested properly
[ https://issues.apache.org/jira/browse/SPARK-11694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11694: --- Assignee: Hyukjin Kwon > Parquet logical types are not being tested properly > --- > > Key: SPARK-11694 > URL: https://issues.apache.org/jira/browse/SPARK-11694 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 1.6.0 > > > All the physical types are properly tested at {{ParquetIOSuite}} but logical > type mapping is not being tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
[ https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005249#comment-15005249 ] Xin Wu edited comment on SPARK-10673 at 11/14/15 8:19 AM: -- if the default is false, {code} if (!sc.conf.verifyPartitionPath) { partitionToDeserializer } {code} will not get into the code path you mentioned. What the problem is that when the property is set to true, then, it gets into the code path that potentially evaluates all partitions of the table that matches the pathPatternStr. However, the pathPatternStr is computed as "/pathToTable/\*/\*/.." depending on the number of partition columns. Basically, what it means is to validate the desired partition path against all existing partition paths, including nested directories, which may be a lot.. So to avoid this potential performance issue.. I think we maybe able to simply the code in the else block of function verifyPartitionPath(). I am working on a fix. was (Author: xwu0226): if the default is false, {code} if (!sc.conf.verifyPartitionPath) { partitionToDeserializer } {code} will not get into the code path you mentioned. What the problem is that when the property is set to true, then, it gets into the code path that potentially evaluates all partitions of the table that matches the pathPatternStr. However, the pathPatternStr is computed as "/pathToTable/*/*/.." depending on the number of partition columns. Basically, what it means is to validate the desired partition path against all existing partition paths, including nested directories, which may be a lot.. So to avoid this potential performance issue.. I think we maybe able to simply the code in the else block of function verifyPartitionPath(). I am working on a fix. > spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions > - > > Key: SPARK-10673 > URL: https://issues.apache.org/jira/browse/SPARK-10673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.5.0 >Reporter: Miklos Christine >Priority: Minor > > In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. > In Spark 1.5, it is now set to false by default. > If a table has a lot of partitions in the underlying filesystem, the code > unnecessarily checks for all the underlying directories when executing a > query. > https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162 > Structure: > {code} > /user/hive/warehouse/table1/year=2015/month=01/ > /user/hive/warehouse/table1/year=2015/month=02/ > /user/hive/warehouse/table1/year=2015/month=03/ > ... > /user/hive/warehouse/table1/year=2014/month=01/ > /user/hive/warehouse/table1/year=2014/month=02/ > {code} > If the registered partitions only contain year=2015 when you run "show > partitions table1", this code path checks for all directories under the > table's root directory. This incurs a significant performance penalty if > there are a lot of partition directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work
[ https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-11721: --- > The programming guide for Spark SQL in Spark 1.3.0 needs additional imports > to work > --- > > Key: SPARK-11721 > URL: https://issues.apache.org/jira/browse/SPARK-11721 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Neelesh Srinivas Salian >Priority: Trivial > Fix For: 1.3.0 > > > The documentation in > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the > Programmatically Specifying the Schema section needs to add couple more > imports to get the example to run. > Import statements for Row and sql.types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0
[ https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005284#comment-15005284 ] Sean Owen commented on SPARK-11553: --- That's clear already. We normally assign after it's fixed. > row.getInt(i) if row[i]=null returns 0 > -- > > Key: SPARK-11553 > URL: https://issues.apache.org/jira/browse/SPARK-11553 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tofigh >Priority: Minor > > row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even > according to the document they should throw nullException error) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11728) Replace example code in ml-ensembles.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11728: Assignee: Apache Spark > Replace example code in ml-ensembles.md using include_example > - > > Key: SPARK-11728 > URL: https://issues.apache.org/jira/browse/SPARK-11728 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Apache Spark > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite
[ https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005782#comment-15005782 ] Apache Spark commented on SPARK-11672: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/9719 > Flaky test: ml.JavaDefaultReadWriteSuite > > > Key: SPARK-11672 > URL: https://issues.apache.org/jira/browse/SPARK-11672 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.6.0 > > > Saw several failures on Jenkins, e.g., > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11669) Python interface to SparkR GLM module
[ https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11669: -- Target Version/s: (was: 1.5.0, 1.5.1) [~shubhanshumis...@gmail.com] it doesn't make sense to target released versions. If someone can explain this to me, feel free to reopen, but it sounds like you're requesting Python APIs to R. > Python interface to SparkR GLM module > - > > Key: SPARK-11669 > URL: https://issues.apache.org/jira/browse/SPARK-11669 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Affects Versions: 1.5.0, 1.5.1 > Environment: LINUX > MAC > WINDOWS >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: GLM, pyspark, sparkR, statistics > > There should be a python interface to the sparkR GLM module. Currently the > only python library which creates R style GLM module results in statsmodels. > Inspiration for the API can be taken from the following page. > http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7799) Move "StreamingContext.actorStream" to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7799: - Target Version/s: (was: 1.6.0) > Move "StreamingContext.actorStream" to a separate project and deprecate it in > StreamingContext > -- > > Key: SPARK-7799 > URL: https://issues.apache.org/jira/browse/SPARK-7799 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Shixiong Zhu > > Move {{StreamingContext.actorStream}} to a separate project and deprecate it > in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches
[ https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7441: - Target Version/s: (was: 1.6.0) > Implement microbatch functionality so that Spark Streaming can process a > large backlog of existing files discovered in batch in smaller batches > --- > > Key: SPARK-7441 > URL: https://issues.apache.org/jira/browse/SPARK-7441 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Emre Sevinç > Labels: performance > > Implement microbatch functionality so that Spark Streaming can process a huge > backlog of existing files discovered in batch in smaller batches. > Spark Streaming can process already existing files in a directory, and > depending on the value of "{{spark.streaming.minRememberDuration}}" (60 > seconds by default, see SPARK-3276 for more details), this might mean that a > Spark Streaming application can receive thousands, or hundreds of thousands > of files within the first batch interval. This, in turn, leads to something > like a 'flooding' effect for the streaming application, that tries to deal > with a huge number of existing files in a single batch interval. > We will propose a very simple change to > {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a > configuration property such as "{{spark.streaming.microbatch.size}}", it will > either keep its default behavior when {{spark.streaming.microbatch.size}} > will have the default value of {{0}} (meaning as many as has been discovered > as new files in the current batch interval), or will process new files in > groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s). > We have tested this patch in one of our customers, and it's been running > successfully for weeks (e.g. there were cases where our Spark Streaming > application was stopped, and in the meantime tens of thousands file were > created in a directory, and our Spark Streaming application had to process > those existing files after it was started). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6227: - Target Version/s: (was: 1.6.0) > PCA and SVD for PySpark > --- > > Key: SPARK-6227 > URL: https://issues.apache.org/jira/browse/SPARK-6227 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.2.1 >Reporter: Julien Amelot >Assignee: Manoj Kumar > > The Dimensionality Reduction techniques are not available via Python (Scala + > Java only). > * Principal component analysis (PCA) > * Singular value decomposition (SVD) > Doc: > http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6280) Remove Akka systemName from Spark
[ https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005396#comment-15005396 ] Sean Owen commented on SPARK-6280: -- Are this and the other Akka-related items targeted for 1.6 actually going in? the parent targets 2+. [~zsxwing] > Remove Akka systemName from Spark > - > > Key: SPARK-6280 > URL: https://issues.apache.org/jira/browse/SPARK-6280 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Shixiong Zhu > > `systemName` is a Akka concept. A RPC implementation does not need to support > it. > We can hard code the system name in Spark and hide it in the internal Akka > RPC implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005399#comment-15005399 ] Jeff Zhang commented on SPARK-11725: I am on master > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeff Zhang > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005433#comment-15005433 ] Herman van Hovell commented on SPARK-11725: --- I can reproduce the {{-1}} default values on master. This is not the expected behavior. [~marmbrus]/[~rxin] Any what causes this? > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeff Zhang > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11720: -- Component/s: SQL > Return Double.NaN instead of null for Mean and Average when count = 0 > - > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jihong MA >Priority: Minor > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11669) Python interface to SparkR GLM module
[ https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11669. --- Resolution: Not A Problem > Python interface to SparkR GLM module > - > > Key: SPARK-11669 > URL: https://issues.apache.org/jira/browse/SPARK-11669 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Affects Versions: 1.5.0, 1.5.1 > Environment: LINUX > MAC > WINDOWS >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: GLM, pyspark, sparkR, statistics > > There should be a python interface to the sparkR GLM module. Currently the > only python library which creates R style GLM module results in statsmodels. > Inspiration for the API can be taken from the following page. > http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version
[ https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11702: -- Component/s: Spark Core Got it, makes more sense now. > Guava ClassLoading Issue When Using Different Hive Metastore Version > > > Key: SPARK-11702 > URL: https://issues.apache.org/jira/browse/SPARK-11702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Joey Paskhay > > A Guava classloading error can occur when using a different version of the > Hive metastore. > Running the latest version of Spark at this time (1.5.1) and patched versions > of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to > "1.0.0" and "spark.sql.hive.metastore.jars" to > "/lib/*:". When trying to > launch the spark-shell, the sqlContext would fail to initialize with: > {code} > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > com/google/common/base/Predicate when creating Hive client using classpath: > > Please make sure that jars for your version of hive and hadoop are included > in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, > defaultValue=builtin, doc=... > {code} > We verified the Guava libraries are in the huge list of the included jars, > but we saw that in the > org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it > seems to assume that *all* "com.google" (excluding "com.google.cloud") > classes should be loaded from the base class loader. The Spark libraries seem > to have *some* "com.google.common.base" classes shaded in but not all. > See > [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E] > and its replies. > The work-around is to add the guava JAR to the "spark.driver.extraClassPath" > property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10530) Kill other task attempts when one taskattempt belonging the same task is succeeded in speculation
[ https://issues.apache.org/jira/browse/SPARK-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10530: -- Target Version/s: (was: 1.6.0) Priority: Minor (was: Major) > Kill other task attempts when one taskattempt belonging the same task is > succeeded in speculation > - > > Key: SPARK-10530 > URL: https://issues.apache.org/jira/browse/SPARK-10530 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Jeff Zhang >Priority: Minor > > Currently when speculation is enabled, other task attempts won't be killed if > one task attempt in the same task is succeeded, it is not resource efficient, > it would be better to kill other task attempts when one taskattempt belonging > the same task is succeeded in speculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10081) Skip re-computing getMissingParentStages in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-10081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10081. --- Resolution: Won't Fix Target Version/s: (was: 1.6.0) > Skip re-computing getMissingParentStages in DAGScheduler > > > Key: SPARK-10081 > URL: https://issues.apache.org/jira/browse/SPARK-10081 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Liang-Chi Hsieh > > In DAGScheduler, we can skip re-computing getMissingParentStages when calling > submitStage in handleJobSubmitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10526) Display cores/memory on ExecutorsTab
[ https://issues.apache.org/jira/browse/SPARK-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10526. --- Resolution: Won't Fix Target Version/s: (was: 1.6.0) > Display cores/memory on ExecutorsTab > > > Key: SPARK-10526 > URL: https://issues.apache.org/jira/browse/SPARK-10526 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Jeff Zhang >Priority: Minor > > It would be nice to display the resource of each executor on web ui. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005391#comment-15005391 ] Herman van Hovell commented on SPARK-11725: --- I'd rather add a warning than prevent this from happening. I cannot reproduce the {{-1}} default values on Spark 1.5.2. For example: {noformat} val id = udf((x: Int) => { x }) val q = sqlContext .range(1 << 10) .select($"id", when(($"id" mod 2) === 1, $"id").as("val1")) .select($"id", $"val1", id($"val1").as("val2")) q.show // Result: id: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(,IntegerType,List(IntegerType)) q: org.apache.spark.sql.DataFrame = [id: bigint, val1: bigint, val2: int] +---+++ | id|val1|val2| +---+++ | 0|null| 0| | 1| 1| 1| | 2|null| 0| | 3| 3| 3| | 4|null| 0| | 5| 5| 5| | 6|null| 0| | 7| 7| 7| | 8|null| 0| | 9| 9| 9| | 10|null| 0| | 11| 11| 11| | 12|null| 0| | 13| 13| 13| | 14|null| 0| | 15| 15| 15| | 16|null| 0| | 17| 17| 17| | 18|null| 0| | 19| 19| 19| +---+++ only showing top 20 rows {noformat} What version of Spark are you using? > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeff Zhang > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder
[ https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11727: -- Assignee: Wenchen Fan > split ExpressionEncoder into FlatEncoder and ProductEncoder > --- > > Key: SPARK-11727 > URL: https://issues.apache.org/jira/browse/SPARK-11727 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes
[ https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11732: -- Labels: (was: newbie) Fix Version/s: (was: 1.6.0) [~thunterdb] don't set Fix version unless it's fixed > MiMa excludes miss private classes > -- > > Key: SPARK-11732 > URL: https://issues.apache.org/jira/browse/SPARK-11732 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.1 >Reporter: Tim Hunter > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The checks in GenerateMIMAIgnore only check for package private classes, not > private classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9516: - Priority: Minor (was: Major) > Improve Thread Dump page > > > Key: SPARK-9516 > URL: https://issues.apache.org/jira/browse/SPARK-9516 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Nan Zhu >Assignee: Nan Zhu >Priority: Minor > > Originally proposed by [~irashid] in > https://github.com/apache/spark/pull/7808#issuecomment-126788335: > we can enhance the current thread dump page with at least the following two > new features: > 1) sort threads by thread status, > 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10250) Scala PairRDDFuncitons.groupByKey() should be fault-tolerant of single large groups
[ https://issues.apache.org/jira/browse/SPARK-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10250: -- Target Version/s: (was: 1.6.0) > Scala PairRDDFuncitons.groupByKey() should be fault-tolerant of single large > groups > --- > > Key: SPARK-10250 > URL: https://issues.apache.org/jira/browse/SPARK-10250 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Matt Cheah >Priority: Minor > > PairRDDFunctions.groupByKey() is less robust that Python's equivalent, as > PySpark's groupByKey can spill single large groups to disk. We should bring > the Scala implementation up to parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9516: - Target Version/s: (was: 1.6.0) > Improve Thread Dump page > > > Key: SPARK-9516 > URL: https://issues.apache.org/jira/browse/SPARK-9516 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Nan Zhu >Assignee: Nan Zhu > > Originally proposed by [~irashid] in > https://github.com/apache/spark/pull/7808#issuecomment-126788335: > we can enhance the current thread dump page with at least the following two > new features: > 1) sort threads by thread status, > 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10062) Use tut for typechecking and running code in user guides
[ https://issues.apache.org/jira/browse/SPARK-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10062: -- Target Version/s: (was: 1.6.0) > Use tut for typechecking and running code in user guides > > > Key: SPARK-10062 > URL: https://issues.apache.org/jira/browse/SPARK-10062 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Feynman Liang > > The current process for contributing to the user guide requires > authors/reviewers to manually run any added example code. > We can automate this process by integrating > [tut|https://github.com/tpolecat/tut] into user guide documentation > generation. Tut runs code enclosed inside "```tut ... ```" blocks, providing > typechecking, ensuring that the example code we provide runs, and displaying > the output. > An example project using tuts is > [cats|http://non.github.io/cats//typeclasses.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9844) File appender race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005504#comment-15005504 ] Jason Huang commented on SPARK-9844: Got the same error log in workers and my workers keep being disassociated. 15/11/15 01:25:26 INFO worker.Worker: Asked to kill executor app-20151115012248-0081/2 15/11/15 01:25:26 INFO worker.ExecutorRunner: Runner thread for executor app-20151115012248-0081/2 interrupted 15/11/15 01:25:26 INFO worker.ExecutorRunner: Killing process! 15/11/15 01:25:26 ERROR logging.FileAppender: Error writing stream to file /usr/local/spark-1.5.1-bin-hadoop2.6/work/app-20151115012248-0081/2/stderr java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) at java.io.BufferedInputStream.read1(BufferedInputStream.java:272) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 15/11/15 01:25:26 INFO worker.Worker: Executor app-20151115012248-0081/2 finished with state KILLED exitStatus 143 15/11/15 01:25:26 INFO worker.Worker: Cleaning up local directories for application app-20151115012248-0081 15/11/15 01:25:26 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.1.2.1:46780] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 15/11/15 01:25:26 INFO shuffle.ExternalShuffleBlockResolver: Application app-20151115012248-0081 removed, cleanupLocalDirs = true We use python3 to run our Spark jobs #!/usr/bin/python3 import os import sys SPARK_HOME = "/usr/local/spark" os.environ["SPARK_HOME"] = SPARK_HOME os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle" os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3" sys.path.append(os.path.join(SPARK_HOME, 'python')) sys.path.append(os.path.join(SPARK_HOME, 'python/lib/py4j-0.8.2.1-src.zip')) from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster("spark://10.1.2.1:7077") .setAppName("Generate") .setAll(( ("spark.cores.max", "1"), ("spark.driver.memory", "1g"), ("spark.executor.memory", "1g"), ("spark.python.worker.memory", "1g" > File appender race condition during SparkWorker shutdown > > > Key: SPARK-9844 > URL: https://issues.apache.org/jira/browse/SPARK-9844 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Alex Liu > > We find this issue still exists in 1.3.1 > {code} > ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 > - Error writing stream to file > /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr > ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 > - java.io.IOException: Stream closed > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.FilterInputStream.read(FilterInputStream.java:107) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at >
[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide
[ https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005522#comment-15005522 ] Nathan Davis commented on SPARK-10759: -- [~lmoos], is this in progress? I can take it > Missing Python code example in ML Programming guide > --- > > Key: SPARK-10759 > URL: https://issues.apache.org/jira/browse/SPARK-10759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Raela Wang >Assignee: Lauren Moos >Priority: Minor > Labels: starter > > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9928) LogicalLocalTable in ExistingRDD.scala is not referenced by any code else
[ https://issues.apache.org/jira/browse/SPARK-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9928: --- Assignee: Apache Spark > LogicalLocalTable in ExistingRDD.scala is not referenced by any code else > - > > Key: SPARK-9928 > URL: https://issues.apache.org/jira/browse/SPARK-9928 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1 >Reporter: Gen TANG >Assignee: Apache Spark >Priority: Trivial > Labels: sparksql > > The case class > [LogicalLocalTable|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L118] > in > [ExistingRDD.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala] > is not referenced by anywhere else in the source code. It might be a dead > code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9928) LogicalLocalTable in ExistingRDD.scala is not referenced by any code else
[ https://issues.apache.org/jira/browse/SPARK-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005525#comment-15005525 ] Apache Spark commented on SPARK-9928: - User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/9717 > LogicalLocalTable in ExistingRDD.scala is not referenced by any code else > - > > Key: SPARK-9928 > URL: https://issues.apache.org/jira/browse/SPARK-9928 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1 >Reporter: Gen TANG >Priority: Trivial > Labels: sparksql > > The case class > [LogicalLocalTable|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L118] > in > [ExistingRDD.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala] > is not referenced by anywhere else in the source code. It might be a dead > code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9928) LogicalLocalTable in ExistingRDD.scala is not referenced by any code else
[ https://issues.apache.org/jira/browse/SPARK-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9928: --- Assignee: (was: Apache Spark) > LogicalLocalTable in ExistingRDD.scala is not referenced by any code else > - > > Key: SPARK-9928 > URL: https://issues.apache.org/jira/browse/SPARK-9928 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1 >Reporter: Gen TANG >Priority: Trivial > Labels: sparksql > > The case class > [LogicalLocalTable|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L118] > in > [ExistingRDD.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala] > is not referenced by anywhere else in the source code. It might be a dead > code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005542#comment-15005542 ] Mark Hamstra commented on SPARK-11153: -- Thanks. > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.5.2, 1.6.0 > > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9844) File appender race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005504#comment-15005504 ] Jason Huang edited comment on SPARK-9844 at 11/14/15 5:38 PM: -- Got the same error log in workers and my workers keep being disassociated. {code:java} 15/11/15 01:25:26 INFO worker.Worker: Asked to kill executor app-20151115012248-0081/2 15/11/15 01:25:26 INFO worker.ExecutorRunner: Runner thread for executor app-20151115012248-0081/2 interrupted 15/11/15 01:25:26 INFO worker.ExecutorRunner: Killing process! 15/11/15 01:25:26 ERROR logging.FileAppender: Error writing stream to file /usr/local/spark-1.5.1-bin-hadoop2.6/work/app-20151115012248-0081/2/stderr java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) at java.io.BufferedInputStream.read1(BufferedInputStream.java:272) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 15/11/15 01:25:26 INFO worker.Worker: Executor app-20151115012248-0081/2 finished with state KILLED exitStatus 143 15/11/15 01:25:26 INFO worker.Worker: Cleaning up local directories for application app-20151115012248-0081 15/11/15 01:25:26 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.1.2.1:46780] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 15/11/15 01:25:26 INFO shuffle.ExternalShuffleBlockResolver: Application app-20151115012248-0081 removed, cleanupLocalDirs = true {code} We use python3 to run our Spark jobs {code:java} #!/usr/bin/python3 import os import sys SPARK_HOME = "/usr/local/spark" os.environ["SPARK_HOME"] = SPARK_HOME os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle" os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3" sys.path.append(os.path.join(SPARK_HOME, 'python')) sys.path.append(os.path.join(SPARK_HOME, 'python/lib/py4j-0.8.2.1-src.zip')) from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster("spark://10.1.2.1:7077") .setAppName("Generate") .setAll(( ("spark.cores.max", "1"), ("spark.driver.memory", "1g"), ("spark.executor.memory", "1g"), ("spark.python.worker.memory", "1g" {code} was (Author: jasson15): Got the same error log in workers and my workers keep being disassociated. 15/11/15 01:25:26 INFO worker.Worker: Asked to kill executor app-20151115012248-0081/2 15/11/15 01:25:26 INFO worker.ExecutorRunner: Runner thread for executor app-20151115012248-0081/2 interrupted 15/11/15 01:25:26 INFO worker.ExecutorRunner: Killing process! 15/11/15 01:25:26 ERROR logging.FileAppender: Error writing stream to file /usr/local/spark-1.5.1-bin-hadoop2.6/work/app-20151115012248-0081/2/stderr java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) at java.io.BufferedInputStream.read1(BufferedInputStream.java:272) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 15/11/15 01:25:26 INFO worker.Worker: Executor app-20151115012248-0081/2 finished with state KILLED exitStatus 143 15/11/15 01:25:26 INFO worker.Worker: Cleaning up local directories for application app-20151115012248-0081 15/11/15 01:25:26 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.1.2.1:46780] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 15/11/15 01:25:26 INFO shuffle.ExternalShuffleBlockResolver: Application app-20151115012248-0081
[jira] [Commented] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005509#comment-15005509 ] Reynold Xin commented on SPARK-11725: - This is the problem of default value in codegen I suspect. https://github.com/apache/spark/blob/22e96b87fb0a0eb4f2f1a8fc29a742ceabff952a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L229 > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeff Zhang > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit
[ https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005572#comment-15005572 ] Nicholas Chammas commented on SPARK-11744: -- Not sure who would be the best person to comment on this. Perhaps [~vanzin], since this is part of the launcher? > bin/pyspark --version doesn't return version and exit > - > > Key: SPARK-11744 > URL: https://issues.apache.org/jira/browse/SPARK-11744 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Nicholas Chammas >Priority: Minor > > {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: > {code} > $ ./spark/bin/pyspark --help > Usage: ./bin/pyspark [options] > Options: > ... > --version, Print the version of current Spark > ... > {code} > However, trying to get the version in this way doesn't yield the expected > results. > Instead of printing the version and exiting, we get the version, a stack > trace, and then get dropped into a plain Python shell ({{sc}} is not defined). > {code} > $ ./spark/bin/pyspark --version > Python 2.7.10 (default, Aug 11 2015, 23:39:10) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > > Type --help for more information. > Traceback (most recent call last): > File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in > sc = SparkContext(pyFiles=add_files) > File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in > launch_gateway > raise Exception("Java gateway process exited before sending the driver > its port number") > Exception: Java gateway process exited before sending the driver its port > number > >>> > >>> sc > Traceback (most recent call last): > File "", line 1, in > NameError: name 'sc' is not defined > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11744) bin/pyspark --version doesn't return version and exit
[ https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-11744: - Description: {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: {code} $ ./spark/bin/pyspark --help Usage: ./bin/pyspark [options] Options: ... --version, Print the version of current Spark ... {code} However, trying to get the version in this way doesn't yield the expected results. Instead of printing the version and exiting, we get the version, a stack trace, and then get dropped into a plain Python shell ({{sc}} is not defined). {code} $ ./spark/bin/pyspark --version Python 2.7.10 (default, Aug 11 2015, 23:39:10) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Type --help for more information. Traceback (most recent call last): File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in sc = SparkContext(pyFiles=add_files) File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number >>> >>> sc Traceback (most recent call last): File "", line 1, in NameError: name 'sc' is not defined {code} was: {{bin/pyspark --help}} offers a {{--version}} option: {code} $ ./spark/bin/pyspark --help Usage: ./bin/pyspark [options] Options: ... --version, Print the version of current Spark ... {code} However, trying to get the version in this way doesn't yield the expected results. Instead of printing the version and exiting, we get the version, a stack trace, and then get dropped into a plain Python shell ({{sc}} is not defined). {code} $ ./spark/bin/pyspark --version Python 2.7.10 (default, Aug 11 2015, 23:39:10) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Type --help for more information. Traceback (most recent call last): File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in sc = SparkContext(pyFiles=add_files) File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number >>> >>> sc Traceback (most recent call last): File "", line 1, in NameError: name 'sc' is not defined {code} > bin/pyspark --version doesn't return version and exit > - > > Key: SPARK-11744 > URL: https://issues.apache.org/jira/browse/SPARK-11744 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Nicholas Chammas >Priority: Minor > > {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: > {code} > $ ./spark/bin/pyspark --help > Usage: ./bin/pyspark [options] > Options: > ... > --version, Print the version of current Spark > ... > {code} > However, trying to get the version in this way doesn't yield the expected > results. > Instead of printing the version and exiting, we get the version, a stack > trace, and then get dropped into a plain Python shell ({{sc}} is not defined). > {code} > $ ./spark/bin/pyspark --version > Python 2.7.10 (default, Aug 11 2015, 23:39:10) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > > Type --help for more information. > Traceback (most recent call last): > File
[jira] [Created] (SPARK-11744) bin/pyspark --version doesn't return version and exit
Nicholas Chammas created SPARK-11744: Summary: bin/pyspark --version doesn't return version and exit Key: SPARK-11744 URL: https://issues.apache.org/jira/browse/SPARK-11744 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.2 Reporter: Nicholas Chammas Priority: Minor {{bin/pyspark --help}} offers a {{--version}} option: {code} $ ./spark/bin/pyspark --help Usage: ./bin/pyspark [options] Options: ... --version, Print the version of current Spark ... {code} However, trying to get the version in this way doesn't yield the expected results. Instead of printing the version and exiting, we get the version, a stack trace, and then get dropped into a plain Python shell ({{sc}} is not defined). {code} $ ./spark/bin/pyspark --version Python 2.7.10 (default, Aug 11 2015, 23:39:10) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Type --help for more information. Traceback (most recent call last): File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in sc = SparkContext(pyFiles=add_files) File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number >>> >>> sc Traceback (most recent call last): File "", line 1, in NameError: name 'sc' is not defined {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11744) bin/pyspark --version doesn't return version and exit
[ https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-11744: - Description: {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: {code} $ ./spark/bin/pyspark --help Usage: ./bin/pyspark [options] Options: ... --version, Print the version of current Spark ... {code} However, trying to get the version in this way doesn't yield the expected results. Instead of printing the version and exiting, we get the version, a stack trace, and then get dropped into a broken PySpark shell. {code} $ ./spark/bin/pyspark --version Python 2.7.10 (default, Aug 11 2015, 23:39:10) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Type --help for more information. Traceback (most recent call last): File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in sc = SparkContext(pyFiles=add_files) File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number >>> >>> sc Traceback (most recent call last): File "", line 1, in NameError: name 'sc' is not defined {code} was: {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: {code} $ ./spark/bin/pyspark --help Usage: ./bin/pyspark [options] Options: ... --version, Print the version of current Spark ... {code} However, trying to get the version in this way doesn't yield the expected results. Instead of printing the version and exiting, we get the version, a stack trace, and then get dropped into a plain Python shell ({{sc}} is not defined). {code} $ ./spark/bin/pyspark --version Python 2.7.10 (default, Aug 11 2015, 23:39:10) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Type --help for more information. Traceback (most recent call last): File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in sc = SparkContext(pyFiles=add_files) File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway raise Exception("Java gateway process exited before sending the driver its port number") Exception: Java gateway process exited before sending the driver its port number >>> >>> sc Traceback (most recent call last): File "", line 1, in NameError: name 'sc' is not defined {code} > bin/pyspark --version doesn't return version and exit > - > > Key: SPARK-11744 > URL: https://issues.apache.org/jira/browse/SPARK-11744 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Nicholas Chammas >Priority: Minor > > {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: > {code} > $ ./spark/bin/pyspark --help > Usage: ./bin/pyspark [options] > Options: > ... > --version, Print the version of current Spark > ... > {code} > However, trying to get the version in this way doesn't yield the expected > results. > Instead of printing the version and exiting, we get the version, a stack > trace, and then get dropped into a broken PySpark shell. > {code} > $ ./spark/bin/pyspark --version > Python 2.7.10 (default, Aug 11 2015, 23:39:10) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > > Type --help for more information. > Traceback (most recent call last): > File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in > sc =
[jira] [Commented] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
[ https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005577#comment-15005577 ] Xin Wu commented on SPARK-10673: The fix is being tested.. will submit PR shortly. > spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions > - > > Key: SPARK-10673 > URL: https://issues.apache.org/jira/browse/SPARK-10673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.5.0 >Reporter: Miklos Christine >Priority: Minor > > In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. > In Spark 1.5, it is now set to false by default. > If a table has a lot of partitions in the underlying filesystem, the code > unnecessarily checks for all the underlying directories when executing a > query. > https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162 > Structure: > {code} > /user/hive/warehouse/table1/year=2015/month=01/ > /user/hive/warehouse/table1/year=2015/month=02/ > /user/hive/warehouse/table1/year=2015/month=03/ > ... > /user/hive/warehouse/table1/year=2014/month=01/ > /user/hive/warehouse/table1/year=2014/month=02/ > {code} > If the registered partitions only contain year=2015 when you run "show > partitions table1", this code path checks for all directories under the > table's root directory. This incurs a significant performance penalty if > there are a lot of partition directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11738) Make ArrayType orderable
[ https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11738: - Summary: Make ArrayType orderable (was: Make array orderable) > Make ArrayType orderable > > > Key: SPARK-11738 > URL: https://issues.apache.org/jira/browse/SPARK-11738 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join
[ https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005682#comment-15005682 ] Zhan Zhang commented on SPARK-11704: [~maropu] You are right. I mean fetching from network is a big overhead. Feel free to work on it. > Optimize the Cartesian Join > --- > > Key: SPARK-11704 > URL: https://issues.apache.org/jira/browse/SPARK-11704 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Zhan Zhang > > Currently CartesianProduct relies on RDD.cartesian, in which the computation > is realized as follows > override def compute(split: Partition, context: TaskContext): Iterator[(T, > U)] = { > val currSplit = split.asInstanceOf[CartesianPartition] > for (x <- rdd1.iterator(currSplit.s1, context); > y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) > } > From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. > Which is really heavy and may never finished if n is large, especially when > rdd2 is coming from ShuffleRDD. > We should have some optimization on CartesianProduct by caching rightResults. > The problem is that we don’t have cleanup hook to unpersist rightResults > AFAIK. I think we should have some cleanup hook after query execution. > With the hook available, we can easily optimize such Cartesian join. I > believe such cleanup hook may also benefit other query optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11738) Make ArrayType orderable
[ https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11738: Assignee: Apache Spark > Make ArrayType orderable > > > Key: SPARK-11738 > URL: https://issues.apache.org/jira/browse/SPARK-11738 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11738) Make ArrayType orderable
[ https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11738: Assignee: (was: Apache Spark) > Make ArrayType orderable > > > Key: SPARK-11738 > URL: https://issues.apache.org/jira/browse/SPARK-11738 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11738) Make ArrayType orderable
[ https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005689#comment-15005689 ] Apache Spark commented on SPARK-11738: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/9718 > Make ArrayType orderable > > > Key: SPARK-11738 > URL: https://issues.apache.org/jira/browse/SPARK-11738 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org