[GitHub] spark pull request: Unify the logic for column pruning, projection...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/213 Unify the logic for column pruning, projection, and filtering of table scans. This removes a bunch of duplicated logic, dead code and casting when planning parquet table scans and hive table scans. Other changes: - Fix tests now that we are doing a better job of column pruning (i.e. since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included unless they are also included in the final output of this logical plan fragment. - Add rule to simplify trivial filters. This was required to avoid `WHERE false` from getting pushed into table scans, since `HiveTableScan` (reasonably) refuses to apply partition pruning predicates to non-partitioned tables. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark strategyCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #213 commit 0ae86cfcba56b700d8e7bd869379f0c663b21c1e Author: Michael Armbrust mich...@databricks.com Date: 2014-03-24T04:57:42Z Unify the logic for column pruning, projection, and filtering of table scans for both Hive and Parquet relations. Fix tests now that we are doing a better job of column pruning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10873319 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format. + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader. --- End diff -- Sorry I forgot this is Java. Then you should use `@link` instead of `[[]]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38414578 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38414579 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13381/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify the logic for column pruning, projection...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/213#issuecomment-38414612 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: java.nio.charset.MalformedInputException
Github user scwf closed the pull request at: https://github.com/apache/spark/pull/212 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
GitHub user qqsun8819 opened a pull request: https://github.com/apache/spark/pull/214 [SPARK-1141] [WIP] Parallelize Task Serialization https://spark-project.atlassian.net/browse/SPARK-1141 @kayousterhout copied from JIRA(design doc in JIRA is old, I'll update it later) TaskSetManager.resourceOffer will return a TaskDescWithoutSerializeTask object , this object will be a half-copy of TaskDescrption exception _serializedTask ByteBffer, instead, it will contain a Task object and seriailze part inside TaskSetManager.resourceOffer will be moved to TaskSchedulerImpl's Runnable working thread which will be placed inside threadpool. DriverSuite failed in my own env. Working on fixing You can merge this pull request into a Git repository by running: $ git pull https://github.com/qqsun8819/spark task-serialize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/214.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #214 commit 53795965dd16c54a4981ef4ee754f326663f9795 Author: Ouyang Jin jin@alibaba-inc.com Date: 2014-03-16T15:57:43Z Initial version of Parallelize Task Serialization in dev code, but this version has a chance to hang in multi-task execution and needs debug commit 0bb37447d403c63b21b06cf15a612eb363c701da Author: OuYang Jin jin@alibaba-inc.com Date: 2014-03-23T14:47:56Z Merge remote-tracking branch 'upstream/master' into task-serialize commit 177195d20ddef34d339f6385d50382944c9c149d Author: OuYang Jin jin@alibaba-inc.com Date: 2014-03-24T06:16:27Z Modify asychroniazed sleep wait to pass job running case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed coding style issues in Spark SQL
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/208#issuecomment-38415798 ```scala package org.apache.spark.sql package catalyst ``` vs ```scala package org.apache.spark.sql.catalyst ``` There are three reasons I think it'd be better to just have one package statement in a file. 1. It is easier to understand by most programmers, especially the ones that come from Java land (I was chatting with another committer just now and he got confused by the meaning of having two or three package statements in a Scala file). 2. It requires explicit import to open up the parent package scope and avoids polluting the namespace (there is no difference in terms of line count here since you add one import but remove one package) 3. It is more consistent with the rest of Spark code base. Now this is a highly subjective topic, so we should get others to chime in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed coding style issues in Spark SQL
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/208#issuecomment-38416577 @pwendell When I started on this PR, I was also puzzled which option to use at first, until I saw the same usage in Scala standard library. But I should confess that I wasn't 100% sure about what the first option exactly mean until I played around both of them a bit. I think the most significant advantage of the first option is that we open parent packages implicitly. But since we forbid relative package imports, this is not exactly an advantage any more. So I vote for the second option now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/214#issuecomment-38416907 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1057: Upgrade fastutil to 6.5.11
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/215#issuecomment-38416906 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1144 Added license and RAT to check lice...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/125#issuecomment-38416924 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1294 Fix resolution of uppercase field n...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/202#issuecomment-38416911 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1144 Added license and RAT to check lice...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/125#issuecomment-38416925 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1294 Fix resolution of uppercase field n...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/202#issuecomment-38416912 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1144 Added license and RAT to check lice...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/125#issuecomment-38417050 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13386/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1057: Upgrade fastutil to 6.5.11
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/215#issuecomment-38417092 LGTM - not sure about merging into 0.9.1 though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed coding style issues in Spark SQL
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/208#issuecomment-38417357 I think a point here is whether we should assume all contributors use IDEs like IntelliJ and their automated features. At least, I find in most scenarios, default behaviours of IntelliJ match Spark coding convention well. Exceptions include indentation and false positive suggestions about adding/removing parenthesis to/from Java getter/setter methods. Maybe we can suggest developers (including ourselves) to rely on IDE more and even provide a default IntelliJ configuration that match Spark coding convention better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix SPARK-1280: Stage.name return apply at Op...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/180#issuecomment-38417874 Who can merge the improvement for web UI? @aarondav --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix SPARK-1280: Stage.name return apply at Op...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/180#issuecomment-38418878 Thanks for this patch. Would you mind providing an example stack trace where this helped? I want to get a better sense of the issue to see if this is specific to Option or part of a more sinister problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1186] : Enrich the Spark Shell to suppo...
Github user berngp commented on the pull request: https://github.com/apache/spark/pull/116#issuecomment-38418948 @aarondav provided the squash commit with the agreed changes. Bellow the help message that we could have if we want to support the args used by #86. Please share your thoughts. ``` ${txtbld}Usage${txtrst}: spark-shell [OPTIONS] ${txtbld}OPTIONS${txtrst}: -h --help : Print this help information. --executor-memory : The memory used by each executor of the Spark Shell, the number is followed by m for megabytes or g for gigabytes, e.g. 1g. --driver-memory : The memory used by the Spark Shell, the number is followed by m for megabytes or g for gigabytes, e.g. 1g, defaults to 512Mb. --master: A full string that describes the Spark Master, defaults to local e.g. spark://localhost:7077. --log-conf : Enables logging of the supplied SparkConf as INFO at start of the Spark Context. ${txtbld}Spark standalone with cluster deploy mode only${txtrst}: --driver-cores : Cores for driver. --supervise : Whether to restart the driver on failure. ${txtbld}Spark standalone and Mesos only${txtrst}: --total-executor-cores : CORES Total cores for all executors. ${txtbld}YARN-only${txtrst}: --executor-cores: Number of cores per executor (Default: 1). --executor-memory : Memory per executor (e.g. 1000M, 2G) (Default: 1G). --queue QUEUE : The YARN queue to submit the application to (Default: 'default'). --num-executors NUM : Number of executors to start (Default: 2). --files FILES : Comma separated list of files to be placed next to all executors. --archives ARCHIVES : Comma separated list of archives to be extracted next to all executors. e.g. spark-shell -m spark://localhost:7077 -c 4 -dm 512m -em 2g ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1057: Upgrade fastutil to 6.5.11
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/215#issuecomment-38419532 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13384/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1057: Upgrade fastutil to 6.5.11
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/215#issuecomment-38419531 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1294 Fix resolution of uppercase field n...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/202#issuecomment-38419522 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13385/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38419549 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38419548 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/124#issuecomment-38419558 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1144 Added license and RAT to check lice...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/125#issuecomment-38419554 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1186] : Enrich the Spark Shell to suppo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/116#issuecomment-38419564 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1186] : Enrich the Spark Shell to suppo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/116#issuecomment-38419565 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38420087 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13387/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38420085 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix SPARK-1280: Stage.name return apply at Op...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/180#issuecomment-38422713 val pairs = sc.parallelize(Array((1, 1), (1, 2), (1, 3), (2, 1))) pairs.take(1) http://host:4040/stages/ Completed Stages table Description Column = apply at Option.scala:120 Option.scala:120 @inline final def getOrElse[B : A](default: = B): B = if (isEmpty) default else this.get --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1186] : Enrich the Spark Shell to suppo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/116#issuecomment-38423316 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13390/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1144 Added license and RAT to check lice...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/125#issuecomment-38423311 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/124#issuecomment-38423314 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13389/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1144 Added license and RAT to check lice...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/125#issuecomment-38423315 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13388/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Adding an option to persist Spark RDD blocks ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38423375 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Adding an option to persist Spark RDD blocks ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38423374 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify the logic for column pruning, projection...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/213#issuecomment-38423970 Two unused imports are left in `HiveStrategies.scala`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10877199 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading whole text file out in + * (filename, content) format. Each element in split is an record of a unique, whole file. File name + * is the full path name for easy deduplicate. + */ +public class WholeTextFileRecordReader extends RecordReaderString, Text { + private Path path; + + private String key = null; + private Text value = null; + + private boolean processed = false; + + private FileSystem fs; + + public WholeTextFileRecordReader( + CombineFileSplit split, + TaskAttemptContext context, + Integer index) +throws IOException { +path = split.getPath(index); +fs = path.getFileSystem(context.getConfiguration()); + } + + @Override + public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { + } + + @Override + public void close() throws IOException { + } + + @Override + public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; + } + + @Override + public String getCurrentKey() throws IOException, InterruptedException { +return key; + } + + @Override + public Text getCurrentValue() throws IOException, InterruptedException{ +return value; + } + + @Override + public boolean nextKeyValue() throws IOException { +if (!processed) { + if (key == null) { +key = path.toString(); + } + if (value == null) { +value = new Text(); + } + + FSDataInputStream fileIn = null; + try { +fileIn = fs.open(path); +byte[] innerBuffer = IOUtils.toByteArray(fileIn); --- End diff -- @mengxr @yinxusen PS I am happy to take on removal of use of commons-io in favor of equivalents in Guava. There is a bit more than this usage, but it's easy stuff. Commons IO is fine but it's not necessary to use it here and it is one of those dependencies that could collide with other versions in Hadoop. If anyone nods I'll open a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify the logic for column pruning, projection...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/213#issuecomment-38428081 LGTM, much cleaner :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1210] Prevent ContextClassLoader of Act...
Github user ueshin commented on the pull request: https://github.com/apache/spark/pull/15#issuecomment-38432140 Added a test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1210] Prevent ContextClassLoader of Act...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/15#issuecomment-38432384 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1210] Prevent ContextClassLoader of Act...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/15#issuecomment-38432385 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38436606 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1210] Prevent ContextClassLoader of Act...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/15#issuecomment-38436529 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13392/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38436603 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38436605 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1210] Prevent ContextClassLoader of Act...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/15#issuecomment-38436528 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed coding style issues in Spark SQL
Github user alig commented on the pull request: https://github.com/apache/spark/pull/208#issuecomment-38439045 Also +1 for ```package org.apache.spark.sql.catalyst```, just because it's simpler to understand for the majority of the programmers in the world ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1198] Allow pipes tasks to run in diffe...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/128#issuecomment-38440744 yes I can add something, although I don't really have a way to test it. Note that my original question was how we want to go about adding support for windows/linux specific shell commands. Borrowing from Hadoop, we could create some generic classes like UnixShellScriptBuilder and WindowsShellScriptBuilder then based on the OS type instantiate the correct one.Or since this is only one place I could just conditionalize it there and if we add more shell commands it can be made generic later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38441593 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441595 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13393/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441592 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441655 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441656 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441791 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441793 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13395/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441957 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13396/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38441955 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38442090 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38442093 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884042 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -93,6 +96,10 @@ private[spark] class TaskSchedulerImpl( val mapOutputTracker = SparkEnv.get.mapOutputTracker var schedulableBuilder: SchedulableBuilder = null + + private val serializeWorkerPool = new ThreadPoolExecutor(20, 60, 60, TimeUnit.SECONDS, --- End diff -- also KeepAliveTime --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884203 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -243,12 +275,18 @@ private[spark] class TaskSchedulerImpl( } while (launchedTask) } +do { + Thread.sleep(1) +} while(!serializingTask.isEmpty) + if (tasks.size 0) { hasLaunchedTask = true } return tasks } + + --- End diff -- extra line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884189 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -243,12 +275,18 @@ private[spark] class TaskSchedulerImpl( } while (launchedTask) } +do { + Thread.sleep(1) +} while(!serializingTask.isEmpty) + if (tasks.size 0) { hasLaunchedTask = true } return tasks } + --- End diff -- extra line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884300 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -219,18 +226,43 @@ private[spark] class TaskSchedulerImpl( taskSet.parent.name, taskSet.name, taskSet.runningTasks)) } +val ser = SparkEnv.get.closureSerializer.newInstance() // Take each TaskSet in our scheduling order, and then offer it each node in increasing order // of locality levels so that it gets a chance to launch local tasks on all of them. var launchedTask = false +val serializingTask = new HashSet[Long] --- End diff -- as you are not fetching serialize task from this HashSet, but just use taskDesc in L250, can we just replace this with an integer, in L280, when this integer is not zero, keep sleeping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884378 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -219,18 +226,43 @@ private[spark] class TaskSchedulerImpl( taskSet.parent.name, taskSet.name, taskSet.runningTasks)) } +val ser = SparkEnv.get.closureSerializer.newInstance() // Take each TaskSet in our scheduling order, and then offer it each node in increasing order // of locality levels so that it gets a chance to launch local tasks on all of them. var launchedTask = false +val serializingTask = new HashSet[Long] --- End diff -- oh, some concurrency issue here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884645 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -30,6 +30,9 @@ import scala.util.Random import org.apache.spark._ import org.apache.spark.TaskState.TaskState import org.apache.spark.scheduler.SchedulingMode.SchedulingMode +import org.apache.spark.util.Utils +import scala.collection.mutable --- End diff -- Do you mind adjusting the import statements order, see Contributing to Spark wiki page --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10884928 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -219,18 +226,43 @@ private[spark] class TaskSchedulerImpl( taskSet.parent.name, taskSet.name, taskSet.runningTasks)) } +val ser = SparkEnv.get.closureSerializer.newInstance() // Take each TaskSet in our scheduling order, and then offer it each node in increasing order // of locality levels so that it gets a chance to launch local tasks on all of them. var launchedTask = false +val serializingTask = new HashSet[Long] --- End diff -- just take care of the concurrency issue here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38448296 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38448297 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13397/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user qqsun8819 commented on a diff in the pull request: https://github.com/apache/spark/pull/214#discussion_r10886619 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -219,18 +226,43 @@ private[spark] class TaskSchedulerImpl( taskSet.parent.name, taskSet.name, taskSet.runningTasks)) } +val ser = SparkEnv.get.closureSerializer.newInstance() // Take each TaskSet in our scheduling order, and then offer it each node in increasing order // of locality levels so that it gets a chance to launch local tasks on all of them. var launchedTask = false +val serializingTask = new HashSet[Long] --- End diff -- Good point here, I'll try to replace it with AtomicLong --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38462860 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1141] [WIP] Parallelize Task Serializat...
Github user qqsun8819 commented on the pull request: https://github.com/apache/spark/pull/214#issuecomment-38463531 Fix DriverSuite case fail put threadpool inside resourceoffer and shutdown it before it return some other fix according to @CodingCat 's review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed coding style issues in Spark SQL
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/208#issuecomment-38466998 Thanks for all your votes! I'll fix this in a separate PR ASAP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark parquet improvements
Github user AndreSchumacher commented on a diff in the pull request: https://github.com/apache/spark/pull/195#discussion_r10892494 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala --- @@ -72,16 +73,43 @@ case class ParquetRelation(val tableName: String, val path: String) extends Base /** Output **/ override val output = attributes + /** Name (dummy value) */ + // TODO: rethink whether ParquetRelation should inherit from BaseRelation + // (currently required to re-use HiveStrategies but should be removed) + override def tableName = parquet + // Parquet files have no concepts of keys, therefore no Partitioner // Note: we could allow Block level access; needs to be thought through override def isPartitioned = false } object ParquetRelation { + // change this to enable/disable Parquet logging + var DEBUG: Boolean = false + + // TODO: consider redirecting Parquet's log output to log4j logger and + // using config file for log settings + def setParquetLogLevel() { +val level: Level = if (DEBUG) Level.FINEST else Level.WARNING --- End diff -- Now I'm actually reading this here: emj.u.l. to SLF4J translation can seriously increase the cost of disabled logging statements (60 fold or 6000%)/em Apparently there is a way to void this by using logback (a fork of log4j?). Parquet does fairly low-level logging and relies statements on these now being compiled as I understand. Any opinions? I could try if it would work via logback or see how this would degrade performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/216#issuecomment-38471741 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38471725 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13398/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38471724 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/216#issuecomment-38472118 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/216#issuecomment-38472286 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/216#issuecomment-38472412 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/216#issuecomment-38472415 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13400/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Adding an option to persist Spark RDD blocks ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38473429 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: kill the application when DAGSched...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-38476549 I adjusted the code to capture the exception inside processEvent function, so that we can easily test the function in DAGSchedulerSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Adding an option to persist Spark RDD blocks ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38476606 @RongGu I just created a JIRA for this: https://spark-project.atlassian.net/browse/SPARK-1305 Do you mind updating the title here to start with SPARK-1305:? Also, can you create an account on the Spark JIRA so I can assign you as the contributor of this feature? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/207#discussion_r10896563 --- Diff: tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.tools + +import java.io.File +import java.util.jar.JarFile + +import scala.collection.mutable +import scala.collection.JavaConversions._ +import scala.reflect.runtime.universe.runtimeMirror + +/** + * A tool for generating classes to be excluded during binary checking with MIMA. It is expected + * that this tool is run with ./spark-class. + * + * MIMA itself only supports JVM-level visibility and doesn't account for package-private classes. + * This tool looks at all currently package-private classes and generates exclusions for them. Note + * that this approach is not sound. It can lead to false positives if we move or rename a previously + * package-private class. It can lead to false negatives if someone explicitly makes a class + * package-private that wasn't before. This exists only to help catch certain classes of changes + * which might be difficult to catch during review. + */ +object GenerateMIMAIgnore { + private val classLoader = Thread.currentThread().getContextClassLoader + private val mirror = runtimeMirror(classLoader) + + private def classesPrivateWithin(packageName: String): Set[String] = { + +val classes = getClasses(packageName, classLoader) +val privateClasses = mutable.HashSet[String]() + +def isPackagePrivate(className: String) = { + try { +/* Couldn't figure out if it's possible to determine a-priori whether a given symbol + is a module or class. */ + +val privateAsClass = mirror + .staticClass(className) + .privateWithin + .fullName + .startsWith(packageName) + +val privateAsModule = mirror + .staticModule(className) + .privateWithin + .fullName + .startsWith(packageName) + +privateAsClass || privateAsModule + } catch { +case _: Throwable = { + println(Error determining visibility: + className) + false +} + } +} + +for (className - classes) { + val directlyPrivateSpark = isPackagePrivate(className) + + /* Inner classes defined within a private[spark] class or object are effectively + invisible, so we account for them as package private. */ + val indirectlyPrivateSpark = { +val maybeOuter = className.toString.takeWhile(_ != '$') +if (maybeOuter != className) { + isPackagePrivate(maybeOuter) +} else { + false +} + } + if (directlyPrivateSpark || indirectlyPrivateSpark) privateClasses += className +} +privateClasses.flatMap(c = Seq(c, c.replace($, #))).toSet + } + + def main(args: Array[String]) { +scala.tools.nsc.io.File(.mima-excludes). + writeAll(classesPrivateWithin(org.apache.spark).mkString(\n)) +println(Created : .mima-excludes in current directory.) + } + + + private def shouldExclude(name: String) = { +// Heuristic to remove JVM classes that do not correspond to user-facing classes in Scala +name.contains(anon) || --- End diff -- I keep trying to come up with a valid class name that contains anon. In the dictionary, there seems to be Canon, Lebanon, and, of course, anonymous. We're probably safe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...
Github user RongGu commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38478522 @pwendell , I've updated the title of this PR with prefix 'SPARK-1305:'. Also, I've created my account on the Spark JIRA. It's named RongGu. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-38478514 Looks good to me. Merged into master - thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-38478659 Oops, spoke to soon, some changes have apparently made this PR not cleanly mergeable. Mind doing a rebase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/216#discussion_r10897487 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/discretization/EMDDiscretizer.scala --- @@ -0,0 +1,402 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.mllib.discretization + +import scala.collection.mutable.Stack +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.util.InfoTheory +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.mllib.regression.LabeledPoint +import scala.collection.mutable + + +/** + * This class contains methods to discretize continuous values with the method proposed in + * [Fayyad and Irani, Multi-Interval Discretization of Continuous-Valued Attributes, 1993] + */ +class EMDDiscretizer private ( --- End diff -- The name `EMD` is not a common acronym for the algorithm. `MDLP` was used in the paper and `MDL` was used in derived work. But I do think `MDL` is more confusing. Shall we call it `EntropyMinimizationDiscretizer`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38479671 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13399/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38479779 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38479800 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38479780 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38479965 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1094 Support MiMa for reporting binary c...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/207#issuecomment-38480096 Looks good to me! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1303] [MLLIB] Added discretization capa...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/216#discussion_r10897825 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/discretization/MapAccumulator.scala --- @@ -0,0 +1,53 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.mllib.discretization + +import org.apache.spark.AccumulatorParam + +object MapAccumulator extends AccumulatorParam[Map[String, Int]] { --- End diff -- Mark it package private if this is not intended to be used by users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/158#issuecomment-38480623 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---