[GitHub] incubator-spark pull request: SPARK-1124: Fix infinite retries of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/641#issuecomment-35864121 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1124: Fix infinite retries of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/641#issuecomment-35864120 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1124: Fix infinite retries of ...
GitHub user mateiz opened a pull request: https://github.com/apache/incubator-spark/pull/641 SPARK-1124: Fix infinite retries of reduce stage when a map stage failed In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". See https://spark-project.atlassian.net/browse/SPARK-1124 for an example. This PR also cleans up code style slightly where there was a variable named "s" and some weird map manipulation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mateiz/incubator-spark spark-1124-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/641.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #641 commit cd32d5e4dee1291e4509e5965322b7ffe620b1f3 Author: Matei Zaharia Date: 2014-02-24T07:45:48Z SPARK-1124: Fix infinite retries of reduce stage when a map stage failed In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: add threadPool shutdown hook when ki...
Github user wchswchs commented on the pull request: https://github.com/apache/incubator-spark/pull/628#issuecomment-35863734 okï¼i have closed itï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: add threadPool shutdown hook when ki...
Github user wchswchs closed the pull request at: https://github.com/apache/incubator-spark/pull/628 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: add threadPool shutdown hook when ki...
Github user mateiz commented on the pull request: https://github.com/apache/incubator-spark/pull/628#issuecomment-35863650 Given this, can you close the pull request? Or do you plan to try interrupt? That may also not fix the issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/636#discussion_r9983025 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -686,6 +649,47 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) } /** + * Output the RDD to any Hadoop-supported storage system with new Hadoop API, using a Hadoop + * Job object for that storage system. The Job should set an OutputFormat and any output paths + * required (e.g. a table name to write to) in the same way as it would be configured for a Hadoop + * MapReduce job. + */ + def saveAsNewAPIHadoopDataset(job: NewAPIHadoopJob) { --- End diff -- In the new Hadoop API, does this really require a Job or just a Configuration? In the old API we only needed a configuration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...
Github user mateiz commented on the pull request: https://github.com/apache/incubator-spark/pull/636#issuecomment-35863485 Jenkins, this is OK to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1004: PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/640#issuecomment-35863413 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1004: PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/640#issuecomment-35863414 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1004: PySpark on YARN
GitHub user sryza opened a pull request: https://github.com/apache/incubator-spark/pull/640 SPARK-1004: PySpark on YARN Make pyspark work in yarn-client mode. This build's on Josh's work. I tested verified it works on a 5-node cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/incubator-spark sandy-spark-1004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #640 commit e752a6a1c8a9d7cbc31d7b911800e22db6fcb2b0 Author: Josh Rosen Date: 2014-01-24T18:19:58Z Automatically set Yarn env vars in PySpark (SPARK-1030). commit 0adcaa971086853b254baf32748811561bb6e209 Author: Josh Rosen Date: 2014-01-25T23:28:56Z WIP towards PySpark on YARN: - Remove reliance on SPARK_HOME on the workers. Only the driver should know about SPARK_HOME. On the workers, we ensure that the PySpark Python libraries are added to the PYTHONPATH. - Add a Makefile for generating a "fat zip" that contains PySpark's Python dependencies. This is a bit of a hack and I'd be open to better packaging tools, but this doesn't require any extra Python libraries. This use case doesn't seem to be well-addressed by the existing Python packaging tools: there are plenty of tools to package complete Python environments (such as pyinstaller and virtualenv) or to bundle *individual* libraries (e.g. distutils), but few to generate portable fat zips or eggs. This hasn't been tested with YARN and may not actually compile. commit d4a71d0495d072d5b5364601e7cd0dc9a7c9c9b9 Author: Josh Rosen Date: 2014-02-19T06:27:21Z Add missing setup.py file for PySpark. commit dcda63863a41414ba5e410092dc4fbab2e353543 Author: Sandy Ryza Date: 2014-02-24T07:06:42Z Improvements commit 38546d4f282727f3ae112f1e564df72443b726f5 Author: Sandy Ryza Date: 2014-02-24T07:26:01Z Don't set SPARK_JAR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1089] fix the regression prob...
Github user ScrapCodes commented on the pull request: https://github.com/apache/incubator-spark/pull/614#issuecomment-35862636 Also from this I just went ahead and tried fixing this problem in scala and it worked. https://github.com/ScrapCodes/scala/tree/si-6502-fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1089] fix the regression prob...
Github user ScrapCodes commented on the pull request: https://github.com/apache/incubator-spark/pull/614#issuecomment-35859704 Nice catch ! and thanks for taking the time to dig this. I am okay with this way of doing it, however if you and others prefer we can move this code to createInterpreter before creating SparkILoopInterpreter. Even if we don't I think its fine to merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user xoltar commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35857946 Thanks, last change should address all code review comments. Also cleaned up some imports in PairRDDFunctionsSuite that weren't needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: fix building with maven on Mac OS X
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/639#issuecomment-35856579 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35856428 We should put this fix in 0.9 as well once it's ready to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35856419 Thanks a lot for tracking this down, fixing it, and adding tests! I added some minor style comments, modulo those comments LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/638#discussion_r9980747 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -330,4 +335,74 @@ class PairRDDFunctionsSuite extends FunSuite with SharedSparkContext { (1, ArrayBuffer(1)), (2, ArrayBuffer(1 } + + test("saveNewAPIHadoopFile should call setConf if format is configurable") { +val pairs = sc.parallelize(Array((new Integer(1), new Integer(1 +val conf = new Configuration() + +//No error, non-configurable formats still work --- End diff -- Mind adding spaces after these? `// No error, non-configurable formats`... Also it would be nice (but up to you) to use `/* ... */` for multi-line comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/638#discussion_r9980751 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -330,4 +335,74 @@ class PairRDDFunctionsSuite extends FunSuite with SharedSparkContext { (1, ArrayBuffer(1)), (2, ArrayBuffer(1 } + + test("saveNewAPIHadoopFile should call setConf if format is configurable") { +val pairs = sc.parallelize(Array((new Integer(1), new Integer(1 +val conf = new Configuration() + +//No error, non-configurable formats still work +pairs.saveAsNewAPIHadoopFile[FakeFormat]("ignored") + +//Configurable intercepts get configured +//ConfigTestFormat throws an exception if we try to write to it +//when setConf hasn't been thrown first. +//Assertion is in ConfigTestFormat.getRecordWriter +pairs.saveAsNewAPIHadoopFile[ConfigTestFormat]("ignored") + } +} + +// These classes are fakes for testing +// "saveNewAPIHadoopFile should call setConf if format is configurable". +// Unfortunately, they have to be top level classes, and not defined in +// the test method, because otherwise Scala won't generate no-args constructors +// and the test will therefore throw InstantiationException when saveAsNewAPIHadoopFile +// tries to instantiate them with Class.newInstance. +class FakeWriter extends RecordWriter[Integer,Integer] { --- End diff -- `Integer, Integer` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: fix building with maven on Mac OS X
GitHub user witgo opened a pull request: https://github.com/apache/incubator-spark/pull/639 fix building with maven on Mac OS X fix building with maven on Mac OS X throw Failure to find org.eclipse.paho:mqtt-client:jar:0.4.0 in https://repository.apache.org/content/repositories/releases was cached in the local repository, resolution will not be reattempted until the update interval of apache-repo has elapsed or updates are forced You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/incubator-spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/639.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #639 commit 27c612fb0dbbf27ba5a20d870a5cbb5cf33f4d9f Author: liguoqiang Date: 2014-02-24T04:00:36Z fix building with maven on Mac OS X throw Failure to find org.eclipse.paho:mqtt-client:jar:0.4.0 in https://repository.apache.org/content/repositories/releases was cached in the local repository, resolution will not be reattempted until the update interval of apache-repo has elapsed or updates are forced --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/638#discussion_r9980734 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -26,6 +26,11 @@ import com.google.common.io.Files import org.apache.spark.SparkContext._ import org.apache.spark.{Partitioner, SharedSparkContext} +import org.apache.hadoop.mapreduce._ --- End diff -- Mind making your new imports fit the normal style? https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/638#discussion_r9980719 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -617,6 +617,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) attemptNumber) val hadoopContext = newTaskAttemptContext(wrappedConf.value, attemptId) val format = outputFormatClass.newInstance + format match { +case c:Configurable => c.setConf(wrappedConf.value) --- End diff -- I don't think this is specific to hbase - I think this is something we should really have been doing always but it only was noticed due to the fact that hbase replies on this configuration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/638#discussion_r9980723 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -617,6 +617,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) attemptNumber) val hadoopContext = newTaskAttemptContext(wrappedConf.value, attemptId) val format = outputFormatClass.newInstance + format match { +case c:Configurable => c.setConf(wrappedConf.value) --- End diff -- Add a space after the colon: `case c: Configurable => ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35854652 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/572#issuecomment-35854653 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12824/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35854654 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12823/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/572#issuecomment-35854651 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Extending public API
My sense on all this is that it should be done on a case-by-case basis. To add a new API, it needs to be general enough that a lot of users will want to use it. If adding that API confuses users, that’s a problem. However, on the flip side, if it’s not a super-popular function but it’s just 10-20 lines of code, it may still be worth having. The maintenance burden on that is not too high, and users are used to fairly extensive collection libraries. For the joins in particular, we added them because it’s quite easy to mess up writing joins by hand, even once you have cogroup(). One thing we do want to do is start implementing more specialized functionality, like statistics functions, in separate libraries. Right now there are some functions in the RDD API (e.g. sums, means, histograms, etc) that are fairly specific to this domain. Matei On Feb 23, 2014, at 10:18 AM, Amandeep Khurana wrote: > This makes sense. Thanks for clarifying, Mridul. > > As Sean pointed out - a contrib module quickly turns into a legacy code > base that becomes hard to maintain. From that perspective, I think the idea > of a separate sparkbank github that is maintained by Spark contributors > (along with users who wish to contribute add-ons like you've described) and > adhere to the code quality and reviews like the main project seems > appealing. And then not just sparkbank but other things that people might > want to have as a part of the project but doesn't belong to the core > codebase can go there? I don't know if things like this have come up in the > past pull requests. > > -Amandeep > > PS: I'm not a spark committer/contributor so take my opinion fwiw. :) > > > On Sun, Feb 23, 2014 at 1:40 AM, Mridul Muralidharan wrote: > >> Good point, and I was purposefully vague on that since that is something >> which our community should evolve imo : this was just an initial proposal >> :-) >> >> For example: there are multiple ways to do cartesian - and each has its own >> trade offs. >> >> Another candidate could be, as I mentioned, new methods which can be >> expressed as sequences of existing methods but would be slightly more >> performent if done in one shot - like the self cartesian pr, various types >> of join (which can become a contrib of its own btw !), experiments using >> key indexes, ordering, etc. >> >> Addition into sparkbank or contrib (or something bettrr named !) does not >> preclude future migration into core ... just an initial staging area for us >> to e olve the api and get user feedback; without necessarily making spark >> core api unstable. >> >> Obviously, it is not a dumping ground for broken code/ideas ... and must >> follow same level of scrutiny and rigour before committing. >> Regards >> Mridul >> On Feb 23, 2014 11:53 AM, "Amandeep Khurana" wrote: >> >>> Mridul, >>> >>> Can you give examples of APIs that people have contributed (or wanted >>> to contribute) but you categorize as something that would go into >>> piggybank-like (sparkbank)? Curious to know how you'd decide what >>> should go where. >>> >>> Amandeep >>> On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan >>> wrote: Hi, Over the past few months, I have seen a bunch of pull requests which >>> have extended spark api ... most commonly RDD itself. Most of them are either relatively niche case of specialization (which might not be useful for most cases) or idioms which can be expressed (sometimes with minor perf penalty) using existing api. While all of them have non zero value (hence the effort to contribute, >>> and gladly welcomed !) they are extending the api in nontrivial ways and >>> have a maintenance cost ... and we already have a pending effort to clean up >> our interfaces prior to 1.0 I believe there is a need to keep exposed api succint, expressive and functional in spark; while at the same time, encouraging extensions and specialization within spark codebase so that other users can benefit >> from the shared contributions. One approach could be to start something akin to piggybank in pig to contribute user generated specializations, helper utils, etc : bundled >> as part of spark, but not part of core itself. Thoughts, comments ? Regards, Mridul >>> >>
Re: standard way of running a compiled jar
Yes, it is a supported option. I’m just wondering whether we want to create a script for it specifically. Maybe the same script could also allow submitting to the cluster or something. Matei On Feb 23, 2014, at 1:55 PM, Sandy Ryza wrote: > Is the client=driver mode still a supported option (outside of the REPLs), > at least for the medium term? My impression from reading the docs is that > it's the most common, if not recommended, way to submit jobs. If that's > the case, I still think it's important, or at least helpful, to have > something for this mode that addresses the issues below. > > > On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia > wrote: > >> Hey Sandy, >> >> In the long run, the ability to submit driver programs to run in the >> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this. >> This is a feature currently available in the standalone mode that runs the >> driver on a worker node, but it is also how YARN works by default, and it >> wouldn't be too bad to do in Mesos. With this, the user could compile a JAR >> that excludes Spark and still get Spark on the classpath. >> >> This was added in 0.9 as a slightly harder to invoke feature mainly to be >> used for Spark Streaming (since the cluster can also automatically restart >> your driver), but we can create a script around it for submissions. I'd >> like to see a design for such a script that takes into account all the >> deploy modes though, because it would be confusing to use it one way on >> YARN and another way on standalone for instance. Already the YARN submit >> client kind of does what you're looking for. >> >> Matei >> >> On Feb 22, 2014, at 2:08 PM, Sandy Ryza wrote: >> >>> Hey All, >>> >>> I've encountered some confusion about how to run a Spark app from a >>> compiled jar and wanted to bring up the recommended way. >>> >>> It seems like the current standard options are: >>> * Build an uber jar that contains the user jar and all of Spark. >>> * Explicitly include the locations of the Spark jars on the client >>> machine in the classpath. >>> >>> Both of these options have a couple issues. >>> >>> For the uber jar, this means unnecessarily sending all of Spark (and its >>> dependencies) to every executor, as well as including Spark twice in the >>> executor classpaths. This also requires recompiling binaries against the >>> latest version whenever the cluster version is upgraded, lest executor >>> classpaths include two different versions of Spark at the same time. >>> >>> Explicitly including the Spark jars in the classpath is a huge pain >> because >>> their locations can vary significantly between different installations >> and >>> platforms, and makes the invocation more verbose. >>> >>> What seems ideal to me is a script that takes a user jar, sets up the >> Spark >>> classpath, and runs it. This means only the user jar gets shipped across >>> the cluster, but the user doesn't need to figure out how to get the Spark >>> jars onto the client classpath. This is similar to the "hadoop jar" >>> command commonly used for running MapReduce jobs. >>> >>> The spark-class script seems to do almost exactly this, but I've been >> told >>> it's meant only for internal Spark use (with the possible exception of >>> yarn-standalone mode). It doesn't take a user jar as an argument, but one >>> can be added by setting the SPARK_CLASSPATH variable. This script could >> be >>> stabilized for user use. >>> >>> Another option would be to have a "spark-app" script that does what >>> spark-class does, but also masks the decision of whether to run the >> driver >>> in the client process or on the cluster (both standalone and YARN have >>> modes for both of these). >>> >>> Does this all make sense? >>> -Sandy >> >>
[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/572#issuecomment-35852739 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/572#issuecomment-35852737 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35852729 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/606#issuecomment-35852724 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12822/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35852730 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/606#issuecomment-35852723 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...
Github user CodingCat commented on the pull request: https://github.com/apache/incubator-spark/pull/626#issuecomment-35851996 but why not just preventing users from overwriting the directory, no matter whether there is part-*? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...
Github user CodingCat commented on the pull request: https://github.com/apache/incubator-spark/pull/626#issuecomment-35851751 I just went through the Spark Streaming document, it seems that it's safe to follow your suggestion @pwendell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/638#discussion_r9979453 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -617,6 +617,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) attemptNumber) val hadoopContext = newTaskAttemptContext(wrappedConf.value, attemptId) val format = outputFormatClass.newInstance + format match { +case c:Configurable => c.setConf(wrappedConf.value) --- End diff -- do we need some comments here to indicate that this line is to support a special case in HBase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35851045 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/638#issuecomment-35850754 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/606#issuecomment-35850760 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/606#issuecomment-35850759 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/637#issuecomment-35850735 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/637#issuecomment-35850736 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12821/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: For outputformats that are Configura...
GitHub user xoltar opened a pull request: https://github.com/apache/incubator-spark/pull/638 For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xoltar/incubator-spark SPARK-1108 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/638.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #638 commit 7cbcaa10bbf01cf04bba7f2883d1fb9564cd3660 Author: Bryn Keller Date: 2014-02-20T06:00:44Z For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/635#issuecomment-35849312 @markhamstra @pwendell For the use cases, this allCollect operation may be useful in the grid search for a good set of training parameters for machine learning problems. For example, if the dataset is only 500MB but training takes half an hour to finish and we have to try 100 different combinations of training parameters (e.g., rank, regularization constants, and termination tolerance), the wall-clock time can be reduced by distributing the dataset to multiple nodes and training in parallel. Another use case is the replicated join, though locality issues need to be addressed. I agree with you that the implementation is not efficient, which puts heavy load on the driver. @coderxiang , could you try to improve the implementation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/637#issuecomment-35849130 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/637#issuecomment-35849129 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35849112 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35849113 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12820/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/635#issuecomment-35848534 @coderxiang btw - it might be something where we make it a private API so it can be used inside of Spark if other packages need this to do broadcast joins. It would be good to understand a bit the intended use case though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...
GitHub user srowen opened a pull request: https://github.com/apache/incubator-spark/pull/637 SPARK-1084 (part 1). Fix most build warnings. This is a redo of https://github.com/apache/incubator-spark/pull/586 This contains all the same changes, minus dependency changes. It also rebases and squashes some commits that could be combined. After this is in I'll propose part 2, which concerns dependencies. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/incubator-spark SPARK-1084.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/637.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #637 commit 2e52f136474abf911472af2bb639d704605cd171 Author: Sean Owen Date: 2014-02-11T14:37:03Z Replace deprecated Ant with commit a82b841df207128aec23ae9eb3a297e41d1bcc49 Author: Sean Owen Date: 2014-02-11T14:38:23Z Remove dead scaladoc links commit 3b7b2ad9c9a2536da51a1b6af7ebf2aff77fef32 Author: Sean Owen Date: 2014-02-11T14:39:48Z Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates commit b5ccbc9c6360437afabcfc14e81321e5b7b38e4c Author: Sean Owen Date: 2014-02-12T13:45:04Z Fix one new style error introduced in scaladoc warning commit commit 79f1c7acdb9634128d417d704a234058d2993bea Author: Sean Owen Date: 2014-02-23T21:27:02Z Fix two misc javadoc problems commit ee1c1150d482243c190c71931852f2797ec79120 Author: Sean Owen Date: 2014-02-23T23:27:21Z Suppress warnings about legitimate unchecked array creations, or change code to avoid it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35847834 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35847835 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35847506 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35847507 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12819/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35845761 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35845760 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: standard way of running a compiled jar
Is the client=driver mode still a supported option (outside of the REPLs), at least for the medium term? My impression from reading the docs is that it's the most common, if not recommended, way to submit jobs. If that's the case, I still think it's important, or at least helpful, to have something for this mode that addresses the issues below. On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia wrote: > Hey Sandy, > > In the long run, the ability to submit driver programs to run in the > cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this. > This is a feature currently available in the standalone mode that runs the > driver on a worker node, but it is also how YARN works by default, and it > wouldn't be too bad to do in Mesos. With this, the user could compile a JAR > that excludes Spark and still get Spark on the classpath. > > This was added in 0.9 as a slightly harder to invoke feature mainly to be > used for Spark Streaming (since the cluster can also automatically restart > your driver), but we can create a script around it for submissions. I'd > like to see a design for such a script that takes into account all the > deploy modes though, because it would be confusing to use it one way on > YARN and another way on standalone for instance. Already the YARN submit > client kind of does what you're looking for. > > Matei > > On Feb 22, 2014, at 2:08 PM, Sandy Ryza wrote: > > > Hey All, > > > > I've encountered some confusion about how to run a Spark app from a > > compiled jar and wanted to bring up the recommended way. > > > > It seems like the current standard options are: > > * Build an uber jar that contains the user jar and all of Spark. > > * Explicitly include the locations of the Spark jars on the client > > machine in the classpath. > > > > Both of these options have a couple issues. > > > > For the uber jar, this means unnecessarily sending all of Spark (and its > > dependencies) to every executor, as well as including Spark twice in the > > executor classpaths. This also requires recompiling binaries against the > > latest version whenever the cluster version is upgraded, lest executor > > classpaths include two different versions of Spark at the same time. > > > > Explicitly including the Spark jars in the classpath is a huge pain > because > > their locations can vary significantly between different installations > and > > platforms, and makes the invocation more verbose. > > > > What seems ideal to me is a script that takes a user jar, sets up the > Spark > > classpath, and runs it. This means only the user jar gets shipped across > > the cluster, but the user doesn't need to figure out how to get the Spark > > jars onto the client classpath. This is similar to the "hadoop jar" > > command commonly used for running MapReduce jobs. > > > > The spark-class script seems to do almost exactly this, but I've been > told > > it's meant only for internal Spark use (with the possible exception of > > yarn-standalone mode). It doesn't take a user jar as an argument, but one > > can be added by setting the SPARK_CLASSPATH variable. This script could > be > > stabilized for user use. > > > > Another option would be to have a "spark-app" script that does what > > spark-class does, but also masks the decision of whether to run the > driver > > in the client process or on the cluster (both standalone and YARN have > > modes for both of these). > > > > Does this all make sense? > > -Sandy > >
[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35844934 Actually, if somebody creates a ticket for me on https://github.com/fommil/jniloader that's the best way to ensure that I'll actually update the license and release it. I would prefer to use Mozilla if you are happy with that, so please do let me know what you discover. See http://www.apache.org/legal/resolved.html#category-b --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35844613 @srowen hehe, oh, I know. Actually I'm more interested in knowing exactly *why* they don't like LGPL. There have been so many discussions in the past between FSF and ASF that they don't quite appreciate that the rest of us don't understand either side's goals or have the memory of those previous discussions. I am at least confident that the thread has dusted off a lot of misconceptions about the LGPL and ASF's licensing goals. Re: Mozilla license, it's definitely listed under category B in that list. Don't worry, `JNILoader` can be made with AL2 if it needs to be... it's only a file or two anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings
Github user srowen closed the pull request at: https://github.com/apache/incubator-spark/pull/586 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/586#issuecomment-35843240 OK I'm going to come back with two PRs. One will have the squashed final output of this PR, and the other will have the parts related to dependencies (which are now quite trivial I think). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-spark/pull/570 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...
Github user CodingCat commented on the pull request: https://github.com/apache/incubator-spark/pull/626#issuecomment-35842285 @pwendell the second situation can be avoided, sorry, just brain damaged..the only issue is if there is a component relies on the fact that Spark allows the overwriting the directory before~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35842233 @srowen @fommil Breeze is flexible enough that we can swap out different back ends quickly (and let users decide at runtime). So if need be, I can do the work to make both jblas and netlib-java work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35842122 @fommil ASF is silent on the MPL: http://www.apache.org/legal/resolved.html#category-a But Mozilla says it's compatible with AL2: http://www.mozilla.org/MPL/license-policy.html Given the nature of the MPL, I suspect there is no issue. But IANAL. Sam you see what happens when you poke the hornet's nest! I can tell you have pointed opinions about licensing, and encourage you argue the case as long as you care to. The squabble is unlikely to conclude with ASF beards saying "LGPL is cool". I suggest filing a calm second JIRA to ask if there is any official stance on MPL, as that may solve the issue. (Want me to do it?) If not, I think Spark should just go with a different library. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/635#issuecomment-35841766 Hey @coderxiang - this is interesting functionality but I'm -1 on including it in the standard API. The main reason is that this will perform poorly on most large datasets and make it easy for people to shoot themselves in the foot. A second reason is that the use case isn't totally clear - as per some of @markhamstra's comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...
Github user CodingCat commented on the pull request: https://github.com/apache/incubator-spark/pull/626#issuecomment-35841703 @pwendell Thanks for the comments, I also considered what you mentioned, but will that prevent other components like Spark Streaming from doing the right job? (I'm not familiar with streaming, but it seems that it will overwrite the existing directory...) Also how to prevent the situation that the user occasionally run the job over the same directory for two times, but with different partition number (the second running has smaller value); eventually, the directory will contain the results from two runnings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/626#issuecomment-35841445 Hey @CodingCat this approach has a few drawbacks. First, it will mean a pretty bad regression for some users. For instance, say that a user is calling saveAsHadoopFile(/my-dir) and that directory has some other random stuff in at as well. Previously it would have written spark files alongside the other stuff, but with this patch it will silently delete the other data and create the directory. Second, this changes the API's all over the place which we are trying not to do. Third, it's a little scary to have code in spark that's deleting HDFS directories - I'd rather make the user do it explicitly. What if we did the following: We look in the output directory and see if there are any part-XX files in there already, and if so we throw an exception and say that the directory already has output data in it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user pwendell commented on the pull request: https://github.com/apache/incubator-spark/pull/570#issuecomment-35840761 @srowen thanks for this clean-up. I'm going to merge this into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/570#issuecomment-35839844 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12818/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35839024 @mengxr looking through all the Apache authorised licenses, it would appear that the Mozilla license is a better fit with my goals since it would require distributors to make source code available if they make any modifications to `JNILoader`. Does that fit well with your project's goals? I'd rather have this than the Apache License. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...
Github user CodingCat commented on the pull request: https://github.com/apache/incubator-spark/pull/626#issuecomment-35838665 OK, fixed some bugs and squashed the commits, I think it's ready for further review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Extending public API
This makes sense. Thanks for clarifying, Mridul. As Sean pointed out - a contrib module quickly turns into a legacy code base that becomes hard to maintain. From that perspective, I think the idea of a separate sparkbank github that is maintained by Spark contributors (along with users who wish to contribute add-ons like you've described) and adhere to the code quality and reviews like the main project seems appealing. And then not just sparkbank but other things that people might want to have as a part of the project but doesn't belong to the core codebase can go there? I don't know if things like this have come up in the past pull requests. -Amandeep PS: I'm not a spark committer/contributor so take my opinion fwiw. :) On Sun, Feb 23, 2014 at 1:40 AM, Mridul Muralidharan wrote: > Good point, and I was purposefully vague on that since that is something > which our community should evolve imo : this was just an initial proposal > :-) > > For example: there are multiple ways to do cartesian - and each has its own > trade offs. > > Another candidate could be, as I mentioned, new methods which can be > expressed as sequences of existing methods but would be slightly more > performent if done in one shot - like the self cartesian pr, various types > of join (which can become a contrib of its own btw !), experiments using > key indexes, ordering, etc. > > Addition into sparkbank or contrib (or something bettrr named !) does not > preclude future migration into core ... just an initial staging area for us > to e olve the api and get user feedback; without necessarily making spark > core api unstable. > > Obviously, it is not a dumping ground for broken code/ideas ... and must > follow same level of scrutiny and rigour before committing. > Regards > Mridul > On Feb 23, 2014 11:53 AM, "Amandeep Khurana" wrote: > > > Mridul, > > > > Can you give examples of APIs that people have contributed (or wanted > > to contribute) but you categorize as something that would go into > > piggybank-like (sparkbank)? Curious to know how you'd decide what > > should go where. > > > > Amandeep > > > > > On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan > > wrote: > > > > > > Hi, > > > > > > Over the past few months, I have seen a bunch of pull requests which > > have > > > extended spark api ... most commonly RDD itself. > > > > > > Most of them are either relatively niche case of specialization (which > > > might not be useful for most cases) or idioms which can be expressed > > > (sometimes with minor perf penalty) using existing api. > > > > > > While all of them have non zero value (hence the effort to contribute, > > and > > > gladly welcomed !) they are extending the api in nontrivial ways and > > have a > > > maintenance cost ... and we already have a pending effort to clean up > our > > > interfaces prior to 1.0 > > > > > > I believe there is a need to keep exposed api succint, expressive and > > > functional in spark; while at the same time, encouraging extensions and > > > specialization within spark codebase so that other users can benefit > from > > > the shared contributions. > > > > > > One approach could be to start something akin to piggybank in pig to > > > contribute user generated specializations, helper utils, etc : bundled > as > > > part of spark, but not part of core itself. > > > > > > Thoughts, comments ? > > > > > > Regards, > > > Mridul > > >
[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings
Github user aarondav commented on the pull request: https://github.com/apache/incubator-spark/pull/586#issuecomment-35838441 Ah, great, that'll make it simple. We can only merge at the granularity of PRs, so it'd be great if you could split the dependency stuff into its own. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/570#issuecomment-35838298 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/570#issuecomment-35838296 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Anyone wants to look at SPARK-1123?
OK, I know where I was wrong Best, -- Nan Zhu Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Sunday, February 23, 2014 at 12:50 PM, Nan Zhu wrote: > String, it should be get the following helper function > > private[spark] def getKeyClass() = implicitly[ClassTag[K]].runtimeClass > > private[spark] def getValueClass() = implicitly[ClassTag[V]].runtimeClass > > and this is what I run > > scala> val a = sc.textFile("/Users/nanzhu/code/incubator-spark/LICENSE", > 2).map(line => ("a", "b")) > > scala> a.saveAsNewAPIHadoopFile("/Users/nanzhu/code/output_rdd") > java.lang.InstantiationException > at > sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:632) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:590) > at $iwC$$iwC$$iwC$$iwC.(:15) > at $iwC$$iwC$$iwC.(:20) > at $iwC$$iwC.(:22) > at $iwC.(:24) > at (:26) > at .(:30) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:774) > at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1042) > at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:611) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:642) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:606) > at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:790) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:835) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:747) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:595) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:602) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:605) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:928) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:878) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:970) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > > > > > > > -- > Nan Zhu > > > On Sunday, February 23, 2014 at 11:06 AM, Nick Pentreath wrote: > > > Hi > > > > What KeyClass and ValueClass are you trying to save as the keys/values of > > your dataset? > > > > > > > > On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu > (mailto:zhunanmcg...@gmail.com)> wrote: > > > > > Hi, all > > > > > > I found the weird thing on saveAsNewAPIHadoopFile in > > > PairRDDFunctions.scala when working on the other issue, > > > > > > saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the > > > time > > > > > > I checked the commit history of the file, it seems that the API exists for > > > a long time, no one else found this? (that's the reason I'm confusing) > > > > > > Best, > > > > > > -- > > > Nan Zhu > > > > > > > > > > > > >
Re: Anyone wants to look at SPARK-1123?
String, it should be get the following helper function private[spark] def getKeyClass() = implicitly[ClassTag[K]].runtimeClass private[spark] def getValueClass() = implicitly[ClassTag[V]].runtimeClass and this is what I run scala> val a = sc.textFile("/Users/nanzhu/code/incubator-spark/LICENSE", 2).map(line => ("a", "b")) scala> a.saveAsNewAPIHadoopFile("/Users/nanzhu/code/output_rdd") java.lang.InstantiationException at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:374) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:632) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:590) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) at .(:30) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:774) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1042) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:611) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:642) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:606) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:790) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:835) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:747) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:595) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:602) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:605) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:928) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:878) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:970) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) -- Nan Zhu On Sunday, February 23, 2014 at 11:06 AM, Nick Pentreath wrote: > Hi > > What KeyClass and ValueClass are you trying to save as the keys/values of > your dataset? > > > > On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu (mailto:zhunanmcg...@gmail.com)> wrote: > > > Hi, all > > > > I found the weird thing on saveAsNewAPIHadoopFile in > > PairRDDFunctions.scala when working on the other issue, > > > > saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time > > > > I checked the commit history of the file, it seems that the API exists for > > a long time, no one else found this? (that's the reason I'm confusing) > > > > Best, > > > > -- > > Nan Zhu > > > > >
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/570#issuecomment-35837259 @pwendell I addressed the last point about pulling up slf4j-over-log4j12 into core (non-test), and the indentation issue. Tests look good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user srowen commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/570#discussion_r9976665 --- Diff: project/SparkBuild.scala --- @@ -236,13 +236,15 @@ object SparkBuild extends Build { publishLocalBoth <<= Seq(publishLocal in MavenCompile, publishLocal).dependOn ) ++ net.virtualvoid.sbt.graph.Plugin.graphSettings ++ ScalaStyleSettings - val slf4jVersion = "1.7.2" + val slf4jVersion = "1.7.5" val excludeCglib = ExclusionRule(organization = "org.sonatype.sisu.inject") val excludeJackson = ExclusionRule(organization = "org.codehaus.jackson") val excludeNetty = ExclusionRule(organization = "org.jboss.netty") val excludeAsm = ExclusionRule(organization = "asm") val excludeSnappy = ExclusionRule(organization = "org.xerial.snappy") + val excludeCommonsLogging = ExclusionRule(organization = "commons-logging") + val excludeSLF4J = ExclusionRule(organization = "org.slf4j") --- End diff -- @pwendell What I see left are dependencies from third-party libraries on slf4j-api, which is fine. Most depend on 1.7.5 (so good that the version in Spark is bumped to 1.7.5), and a few use 1.6.x, which should be entirely compatible. It's also OK for dependencies to have slf4j-over-log4j12. So AFAICT it's fine in this regard. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...
Github user willb commented on the pull request: https://github.com/apache/incubator-spark/pull/582#issuecomment-35837078 Yes, I'll make the changes today. Thanks, Aaron! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: MLLIB-25: Implicit ALS runs out of m...
Github user MLnick commented on the pull request: https://github.com/apache/incubator-spark/pull/629#issuecomment-35835626 @srowen good catch, thanks Sean. Didn't really think about this when I wrote it. Shows that testing on larger scale input data / params is always required! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
ask for receiving spark user mailing list
hi i want to ask for receiving spark user mailing list -- thanks 王联辉(Lianhui Wang) blog; http://blog.csdn.net/lance_123 兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等
Re: Anyone wants to look at SPARK-1123?
Hi What KeyClass and ValueClass are you trying to save as the keys/values of your dataset? On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu wrote: > Hi, all > > I found the weird thing on saveAsNewAPIHadoopFile in > PairRDDFunctions.scala when working on the other issue, > > saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time > > I checked the commit history of the file, it seems that the API exists for > a long time, no one else found this? (that's the reason I'm confusing) > > Best, > > -- > Nan Zhu > >
[GitHub] incubator-spark pull request: Add Security to Spark - Akka, Http, ...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/332#discussion_r9975936 --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala --- @@ -483,10 +496,131 @@ private[spark] class ConnectionManager(port: Int, conf: SparkConf) extends Loggi /*handleMessage(connection, message)*/ } - private def handleMessage(connectionManagerId: ConnectionManagerId, message: Message) { + private def handleClientAuthNeg( + waitingConn: SendingConnection, + securityMsg: SecurityMessage, + connectionId : ConnectionId) { +if (waitingConn.isSaslComplete()) { + logDebug("Client sasl completed for id: " + waitingConn.connectionId) + connectionsAwaitingSasl -= waitingConn.connectionId + waitingConn.getAuthenticated().synchronized { +waitingConn.getAuthenticated().notifyAll(); + } + return +} else { + var replyToken : Array[Byte] = null + try { +replyToken = waitingConn.sparkSaslClient.saslResponse(securityMsg.getToken); +if (waitingConn.isSaslComplete()) { + logDebug("Client sasl completed after evaluate for id: " + waitingConn.connectionId) + connectionsAwaitingSasl -= waitingConn.connectionId + waitingConn.getAuthenticated().synchronized { +waitingConn.getAuthenticated().notifyAll() + } + return +} +var securityMsgResp = SecurityMessage.fromResponse(replyToken, securityMsg.getConnectionId) +var message = securityMsgResp.toBufferMessage +if (message == null) throw new Exception("Error creating security message") +sendSecurityMessage(waitingConn.getRemoteConnectionManagerId(), message) + } catch { +case e: Exception => { + logError("Error doing sasl client: " + e) + waitingConn.close() + throw new Exception("error evaluating sasl response: " + e) +} + } +} + } + + private def handleServerAuthNeg( + connection: Connection, + securityMsg: SecurityMessage, + connectionId: ConnectionId) { +if (!connection.isSaslComplete()) { + logDebug("saslContext not established") + var replyToken : Array[Byte] = null + try { +connection.synchronized { + if (connection.sparkSaslServer == null) { +logDebug("Creating sasl Server") +connection.sparkSaslServer = new SparkSaslServer(securityManager) + } +} +replyToken = connection.sparkSaslServer.response(securityMsg.getToken) +if (connection.isSaslComplete()) { + logDebug("Server sasl completed: " + connection.connectionId) +} else { + logDebug("Server sasl not completed: " + connection.connectionId) +} +if (replyToken != null) { + var securityMsgResp = SecurityMessage.fromResponse(replyToken, securityMsg.getConnectionId) + var message = securityMsgResp.toBufferMessage + if (message == null) throw new Exception("Error creating security Message") + sendSecurityMessage(connection.getRemoteConnectionManagerId(), message) +} + } catch { +case e: Exception => { + logError("Error in server auth negotiation: " + e) + // It would probably be better to send an error message telling other side auth failed + // but for now just close + connection.close() +} + } +} else { + logDebug("connection already established for this connection id: " + connection.connectionId) +} + } + + + private def handleAuthentication(conn: Connection, bufferMessage: BufferMessage): Boolean = { +if (bufferMessage.isSecurityNeg) { + logDebug("This is security neg message") + + // parse as SecurityMessage + val securityMsg = SecurityMessage.fromBufferMessage(bufferMessage) + val connectionId = new ConnectionId(securityMsg.getConnectionId) + + connectionsAwaitingSasl.get(connectionId) match { +case Some(waitingConn) => { + // Client - this must be in response to us doing Send + logDebug("Client handleAuth for id: " + waitingConn.connectionId) + handleClientAuthNeg(waitingConn, securityMsg, connectionId) +} +case None => { + // Server - someone sent us something and we haven't authenticated yet + logDebug("Server handleAuth for id: " + connectionId) + handleServerAuthNeg(conn, securityMsg, connectionId) +}
[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings
Github user srowen commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/586#discussion_r9975107 --- Diff: project/SparkBuild.scala --- @@ -340,7 +336,8 @@ object SparkBuild extends Build { def streamingSettings = sharedSettings ++ Seq( name := "spark-streaming", libraryDependencies ++= Seq( - "commons-io" % "commons-io" % "2.4" + "commons-io" % "commons-io" % "2.4", + "org.codehaus.jackson" % "jackson-mapper-asl" % "1.9.11" --- End diff -- Also, then I don't see a particular reason to bother excluding jackson (1.8.8) dependencies from Hadoop. It could be a problem to have no Jackson at all. I can undo that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings
Github user srowen commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/586#discussion_r9975099 --- Diff: project/SparkBuild.scala --- @@ -340,7 +336,8 @@ object SparkBuild extends Build { def streamingSettings = sharedSettings ++ Seq( name := "spark-streaming", libraryDependencies ++= Seq( - "commons-io" % "commons-io" % "2.4" + "commons-io" % "commons-io" % "2.4", + "org.codehaus.jackson" % "jackson-mapper-asl" % "1.9.11" --- End diff -- This was just making the sbt build consistent with Maven. But yeah on second glance it does look like Streaming doesn't even use Jackson! This can be removed in both places. Commons IO is used. I'll wait on your comment about splitting into a PR to move forward with fixes like this in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Extending public API
Good point, and I was purposefully vague on that since that is something which our community should evolve imo : this was just an initial proposal :-) For example: there are multiple ways to do cartesian - and each has its own trade offs. Another candidate could be, as I mentioned, new methods which can be expressed as sequences of existing methods but would be slightly more performent if done in one shot - like the self cartesian pr, various types of join (which can become a contrib of its own btw !), experiments using key indexes, ordering, etc. Addition into sparkbank or contrib (or something bettrr named !) does not preclude future migration into core ... just an initial staging area for us to e olve the api and get user feedback; without necessarily making spark core api unstable. Obviously, it is not a dumping ground for broken code/ideas ... and must follow same level of scrutiny and rigour before committing. Regards Mridul On Feb 23, 2014 11:53 AM, "Amandeep Khurana" wrote: > Mridul, > > Can you give examples of APIs that people have contributed (or wanted > to contribute) but you categorize as something that would go into > piggybank-like (sparkbank)? Curious to know how you'd decide what > should go where. > > Amandeep > > > On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan > wrote: > > > > Hi, > > > > Over the past few months, I have seen a bunch of pull requests which > have > > extended spark api ... most commonly RDD itself. > > > > Most of them are either relatively niche case of specialization (which > > might not be useful for most cases) or idioms which can be expressed > > (sometimes with minor perf penalty) using existing api. > > > > While all of them have non zero value (hence the effort to contribute, > and > > gladly welcomed !) they are extending the api in nontrivial ways and > have a > > maintenance cost ... and we already have a pending effort to clean up our > > interfaces prior to 1.0 > > > > I believe there is a need to keep exposed api succint, expressive and > > functional in spark; while at the same time, encouraging extensions and > > specialization within spark codebase so that other users can benefit from > > the shared contributions. > > > > One approach could be to start something akin to piggybank in pig to > > contribute user generated specializations, helper utils, etc : bundled as > > part of spark, but not part of core itself. > > > > Thoughts, comments ? > > > > Regards, > > Mridul >
[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/586#issuecomment-35827729 @aarondav Sure, it's already split into commits, and one of them has the dependency changes: https://github.com/srowen/incubator-spark/commit/6f2f67974bfedd40bafccd77abd0860dcbba4061 Move this to another PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user srowen commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/570#discussion_r9975070 --- Diff: project/SparkBuild.scala --- @@ -236,13 +236,15 @@ object SparkBuild extends Build { publishLocalBoth <<= Seq(publishLocal in MavenCompile, publishLocal).dependOn ) ++ net.virtualvoid.sbt.graph.Plugin.graphSettings ++ ScalaStyleSettings - val slf4jVersion = "1.7.2" + val slf4jVersion = "1.7.5" val excludeCglib = ExclusionRule(organization = "org.sonatype.sisu.inject") val excludeJackson = ExclusionRule(organization = "org.codehaus.jackson") val excludeNetty = ExclusionRule(organization = "org.jboss.netty") val excludeAsm = ExclusionRule(organization = "asm") val excludeSnappy = ExclusionRule(organization = "org.xerial.snappy") + val excludeCommonsLogging = ExclusionRule(organization = "commons-logging") + val excludeSLF4J = ExclusionRule(organization = "org.slf4j") --- End diff -- I thought I got all of them but let me double-check with mvn dependency:tree, and then check that the sbt build does the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user srowen commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/570#discussion_r9975071 --- Diff: project/SparkBuild.scala --- @@ -268,9 +272,9 @@ object SparkBuild extends Build { "it.unimi.dsi" % "fastutil" % "6.4.4", "colt" % "colt" % "1.2.0", "org.apache.mesos" % "mesos"% "0.13.0", -"net.java.dev.jets3t" % "jets3t" % "0.7.1", +"net.java.dev.jets3t" % "jets3t" % "0.7.1" excludeAll(excludeCommonsLogging), "org.apache.derby" % "derby"% "10.4.2.0" % "test", -"org.apache.hadoop"% hadoopClient % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib), +"org.apache.hadoop"% hadoopClient% hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib, excludeCommonsLogging, excludeSLF4J), --- End diff -- Will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...
Github user srowen commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/570#discussion_r9975073 --- Diff: bagel/pom.xml --- @@ -51,6 +51,11 @@ scalacheck_${scala.binary.version} test + + org.slf4j + slf4j-log4j12 + test --- End diff -- Yeah I think that's best, will modify it accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Extending public API
Thank you for bringing this up. I think the current committers are bravely facing down a flood of PRs, and this (among other things) is a step that needs to be taken to scale up and keep this fun. I'd love to have a separate discussion about more steps, but for here I offer two bits of advice from experience: First, you guys most certainly can and should say 'no' to some changes. It's part of keeping the project coherent. It's always good to try to include all contributions, but, appreciating contributions does not always mean accepting them. I have seen projects turned to mush by the 'anything's welcome' mentality. Push back on contributors to contribute the thing you think is right. Please keep the API succinct, yes. Second, contrib/ modules are problematic. It becomes a ball of legacy code that you still have to keep maintaining to compile and run. In a world of Github, I think 'contrib' stuff just belongs in other repos. I know it sounds harmless to have a contrib, but I think you'd find the consensus here is that contrib is a mistake. $0.02 -- -- Sean Owen | Director, Data Science | London On Sun, Feb 23, 2014 at 6:06 AM, Mridul Muralidharan wrote: > Hi, > > Over the past few months, I have seen a bunch of pull requests which have > extended spark api ... most commonly RDD itself. > > Most of them are either relatively niche case of specialization (which > might not be useful for most cases) or idioms which can be expressed > (sometimes with minor perf penalty) using existing api. > > While all of them have non zero value (hence the effort to contribute, and > gladly welcomed !) they are extending the api in nontrivial ways and have a > maintenance cost ... and we already have a pending effort to clean up our > interfaces prior to 1.0 > > I believe there is a need to keep exposed api succint, expressive and > functional in spark; while at the same time, encouraging extensions and > specialization within spark codebase so that other users can benefit from > the shared contributions. > > One approach could be to start something akin to piggybank in pig to > contribute user generated specializations, helper utils, etc : bundled as > part of spark, but not part of core itself. > > Thoughts, comments ? > > Regards, > Mridul
Re: [DISCUSS] Extending public API
I think SPARK-1063 (PR-503) “Add .sortBy(f) method on RDD” would be a good example. Note that I’m not saying that this PR is already qualified to be accepted, just take it as an example: JIRA issue: https://spark-project.atlassian.net/browse/SPARK-1063 GitHub PR: https://github.com/apache/incubator-spark/pull/508 On Feb 23, 2014, at 2:23 PM, Amandeep Khurana wrote: > Mridul, > > Can you give examples of APIs that people have contributed (or wanted > to contribute) but you categorize as something that would go into > piggybank-like (sparkbank)? Curious to know how you'd decide what > should go where. > > Amandeep > >> On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan wrote: >> >> Hi, >> >> Over the past few months, I have seen a bunch of pull requests which have >> extended spark api ... most commonly RDD itself. >> >> Most of them are either relatively niche case of specialization (which >> might not be useful for most cases) or idioms which can be expressed >> (sometimes with minor perf penalty) using existing api. >> >> While all of them have non zero value (hence the effort to contribute, and >> gladly welcomed !) they are extending the api in nontrivial ways and have a >> maintenance cost ... and we already have a pending effort to clean up our >> interfaces prior to 1.0 >> >> I believe there is a need to keep exposed api succint, expressive and >> functional in spark; while at the same time, encouraging extensions and >> specialization within spark codebase so that other users can benefit from >> the shared contributions. >> >> One approach could be to start something akin to piggybank in pig to >> contribute user generated specializations, helper utils, etc : bundled as >> part of spark, but not part of core itself. >> >> Thoughts, comments ? >> >> Regards, >> Mridul
Anyone wants to look at SPARK-1123?
Hi, all I found the weird thing on saveAsNewAPIHadoopFile in PairRDDFunctions.scala when working on the other issue, saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time I checked the commit history of the file, it seems that the API exists for a long time, no one else found this? (that’s the reason I’m confusing) Best, -- Nan Zhu
[GitHub] incubator-spark pull request: [java8API] SPARK-964 Investigate the...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/539#issuecomment-35826379 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/636#issuecomment-35826394 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-spark pull request: [java8API] SPARK-964 Investigate the...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/539#issuecomment-35826380 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12817/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---