Re: Spark 1.0.0 rc3
That suggestion got lost along the way and IIRC the patch didn't have that. It's a good idea though, if nothing else to provide a simple means for backwards compatibility. I created a JIRA for this. It's very straightforward so maybe someone can pick it up quickly: https://issues.apache.org/jira/browse/SPARK-1677 On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler wrote: > Thanks. I'm fine with the logic change, although I was a bit surprised to > see Hadoop used for file I/O. > > Anyway, the jira issue and pull request discussions mention a flag to > enable overwrites. That would be very convenient for a tutorial I'm > writing, although I wouldn't recommend it for normal use, of course. > However, I can't figure out if this actually exists. I found the > spark.files.overwrite property, but that doesn't apply. Does this override > flag, method call, or method argument actually exist? > > Thanks, > Dean > > > On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell wrote: > >> Hi Dean, >> >> We always used the Hadoop libraries here to read and write local >> files. In Spark 1.0 we started enforcing the rule that you can't >> over-write an existing directory because it can cause >> confusing/undefined behavior if multiple jobs output to the directory >> (they partially clobber each other's output). >> >> https://issues.apache.org/jira/browse/SPARK-1100 >> https://github.com/apache/spark/pull/11 >> >> In the JIRA I actually proposed slightly deviating from Hadoop >> semantics and allowing the directory to exist if it is empty, but I >> think in the end we decided to just go with the exact same semantics >> as Hadoop (i.e. empty directories are a problem). >> >> - Patrick >> >> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler >> wrote: >> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's >> using >> > HDFS classes for file I/O, while the same script compiled and running >> with >> > 0.9.1 uses only the local-mode File IO. >> > >> > The script is a variation of the Word Count script. Here are the "guts": >> > >> > object WordCount2 { >> > def main(args: Array[String]) = { >> > >> > val sc = new SparkContext("local", "Word Count (2)") >> > >> > val input = sc.textFile(".../some/local/file").map(line => >> > line.toLowerCase) >> > input.cache >> > >> > val wc2 = input >> > .flatMap(line => line.split("""\W+""")) >> > .map(word => (word, 1)) >> > .reduceByKey((count1, count2) => count1 + count2) >> > >> > wc2.saveAsTextFile("output/some/directory") >> > >> > sc.stop() >> > >> > It works fine compiled and executed with 0.9.1. If I recompile and run >> with >> > 1.0.0-RC1, where the same output directory still exists, I get this >> > familiar Hadoop-ish exception: >> > >> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: >> > Output directory >> > >> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc >> > already exists >> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory >> > >> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc >> > already exists >> > at >> > >> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) >> > at >> > >> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) >> > at >> > >> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) >> > at >> > >> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) >> > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) >> > at spark.activator.WordCount2$.main(WordCount2.scala:42) >> > at spark.activator.WordCount2.main(WordCount2.scala) >> > ... >> > >> > Thoughts? >> > >> > >> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell >> wrote: >> > >> >> Hey All, >> >> >> >> This is not an official vote, but I wanted to cut an RC so that people >> can >> >> test against the Maven artifacts, test building with their >> configuration, >> >> etc. We are still chasing down a few issues and updating docs, etc. >> >> >> >> If you have issues or bug reports for this release, please send an >> e-mail >> >> to the Spark dev list and/or file a JIRA. >> >> >> >> Commit: d636772 (v1.0.0-rc3) >> >> >> >> >> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 >> >> >> >> Binaries: >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/ >> >> >> >> Docs: >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ >> >> >> >> Repository: >> >> https://repository.apache.org/content/repositories/orgapachespark-1012/ >> >> >> >> == API Changes == >> >> If you want to test building against Spark there are some minor API >> >> changes. We'll get these written up for the final release but I'm >> noting a >> >> few here (not comprehensive): >> >> >> >> changes to ML vector specification: >> >> >> >> >> http://people.apache.org/~pwende
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Yeah, the approximation of hssian in LBFGS isn't stateless, and it does depend on previous LBFGS step as Xiangrui also pointed out. It's surprising that it works without error message. I also saw the loss is fluctuating like SGD during the training. We will remove the miniBatch mode in LBFGS in Spark before we've deeper understanding of how "stochastic" LBFGS works. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Apr 29, 2014 at 9:50 PM, David Hall wrote: > Yeah, that's probably the easiest though obviously pretty hacky. > > I'm surprised that the hessian approximation isn't worse than it is. (As > in, I'd expect error messages.) It's obviously line searching much more, so > the approximation must be worse. You might be interested in this online > bfgs: > http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf > > -- David > > > On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai wrote: > >> Have a quick hack to understand the behavior of SLBFGS >> (Stochastic-LBFGS) by overwriting the breeze iterations method to get the >> current LBFGS step to ensure that the objective function is the same during >> the line search step. David, the following is my code, have a better way to >> inject into it? >> >> https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack >> >> Couple findings, >> >> 1) miniBatch (using rdd sample api) for each iteration is slower than >> full data training when the full data is cached. Probably because sample is >> not efficiency in Spark. >> >> 2) Since in the line search steps, we use the same sample of data (the >> same objective function), the SLBFGS actually converges well. >> >> 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps >> (29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS >> with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6 >> seconds) to converge to loss < 0.10. >> >> It seems that as long as the noisy gradient happens in different SLBFGS >> steps, it still works. >> >> (ps, I also tried in line search step, I use different sample of data, >> and it just doesn't work as we expect.) >> >> >> >> Sincerely, >> >> DB Tsai >> --- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Mon, Apr 28, 2014 at 8:55 AM, David Hall wrote: >> >>> That's right. >>> >>> FWIW, caching should be automatic now, but it might be the version of >>> Breeze you're using doesn't do that yet. >>> >>> Also, In breeze.util._ there's an implicit that adds a tee method to >>> iterator, and also a last method. Both are useful for things like this. >>> >>> -- David >>> >>> >>> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai wrote: >>> I think I figure it out. Instead of calling minimize, and record the loss in the DiffFunction, I should do the following. val states = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector) states.foreach(state => lossHistory.append(state.value)) All the losses in states should be decreasing now. Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai wrote: > Also, how many failure of rejection will terminate the optimization > process? How is it related to "numberOfImprovementFailures"? > > Thanks. > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai wrote: > >> Hi David, >> >> I'm recording the loss history in the DiffFunction implementation, >> and that's why the rejected step is also recorded in my loss history. >> >> Is there any api in Breeze LBFGS to get the history which already >> excludes the reject step? Or should I just call "iterations" method and >> check "iteratingShouldStop" instead? >> >> Thanks. >> >> >> Sincerely, >> >> DB Tsai >> --- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote: >> >>> LBFGS will not take a step that sends the objective value up. It >>> might try a step that is "too big" and reject it, so if you're just >>> logging >>> everything that gets tried by LBFGS, you could see that. The >>> "iterations" >>> method of the minimizer should never return an increasing objective >>> value. >>> If yo
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Yeah, that's probably the easiest though obviously pretty hacky. I'm surprised that the hessian approximation isn't worse than it is. (As in, I'd expect error messages.) It's obviously line searching much more, so the approximation must be worse. You might be interested in this online bfgs: http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf -- David On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai wrote: > Have a quick hack to understand the behavior of SLBFGS > (Stochastic-LBFGS) by overwriting the breeze iterations method to get the > current LBFGS step to ensure that the objective function is the same during > the line search step. David, the following is my code, have a better way to > inject into it? > > https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack > > Couple findings, > > 1) miniBatch (using rdd sample api) for each iteration is slower than full > data training when the full data is cached. Probably because sample is not > efficiency in Spark. > > 2) Since in the line search steps, we use the same sample of data (the > same objective function), the SLBFGS actually converges well. > > 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps > (29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS > with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6 > seconds) to converge to loss < 0.10. > > It seems that as long as the noisy gradient happens in different SLBFGS > steps, it still works. > > (ps, I also tried in line search step, I use different sample of data, and > it just doesn't work as we expect.) > > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Mon, Apr 28, 2014 at 8:55 AM, David Hall wrote: > >> That's right. >> >> FWIW, caching should be automatic now, but it might be the version of >> Breeze you're using doesn't do that yet. >> >> Also, In breeze.util._ there's an implicit that adds a tee method to >> iterator, and also a last method. Both are useful for things like this. >> >> -- David >> >> >> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai wrote: >> >>> I think I figure it out. Instead of calling minimize, and record the >>> loss in the DiffFunction, I should do the following. >>> >>> val states = lbfgs.iterations(new CachedDiffFunction(costFun), >>> initialWeights.toBreeze.toDenseVector) >>> states.foreach(state => lossHistory.append(state.value)) >>> >>> All the losses in states should be decreasing now. Am I right? >>> >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> --- >>> My Blog: https://www.dbtsai.com >>> LinkedIn: https://www.linkedin.com/in/dbtsai >>> >>> >>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai wrote: >>> Also, how many failure of rejection will terminate the optimization process? How is it related to "numberOfImprovementFailures"? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai wrote: > Hi David, > > I'm recording the loss history in the DiffFunction implementation, and > that's why the rejected step is also recorded in my loss history. > > Is there any api in Breeze LBFGS to get the history which already > excludes the reject step? Or should I just call "iterations" method and > check "iteratingShouldStop" instead? > > Thanks. > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote: > >> LBFGS will not take a step that sends the objective value up. It >> might try a step that is "too big" and reject it, so if you're just >> logging >> everything that gets tried by LBFGS, you could see that. The "iterations" >> method of the minimizer should never return an increasing objective >> value. >> If you're regularizing, are you including the regularizer in the >> objective >> value computation? >> >> GD is almost never worth your time. >> >> -- David >> >> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai wrote: >> >>> Another interesting benchmark. >>> >>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero >>> elements.* >>> >>> LBFGS converges in 70 seconds, while GD seems to be not progressing. >>> >>> Dense feature vector will be too big to fit in the memory, so only >>> conduct the sparse benchmark. >>> >>> I saw the sometimes the loss bumps up, and it's weird for me. Since >>> the cost function of logistic regression is c
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Have a quick hack to understand the behavior of SLBFGS (Stochastic-LBFGS) by overwriting the breeze iterations method to get the current LBFGS step to ensure that the objective function is the same during the line search step. David, the following is my code, have a better way to inject into it? https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack Couple findings, 1) miniBatch (using rdd sample api) for each iteration is slower than full data training when the full data is cached. Probably because sample is not efficiency in Spark. 2) Since in the line search steps, we use the same sample of data (the same objective function), the SLBFGS actually converges well. 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps (29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6 seconds) to converge to loss < 0.10. It seems that as long as the noisy gradient happens in different SLBFGS steps, it still works. (ps, I also tried in line search step, I use different sample of data, and it just doesn't work as we expect.) Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Apr 28, 2014 at 8:55 AM, David Hall wrote: > That's right. > > FWIW, caching should be automatic now, but it might be the version of > Breeze you're using doesn't do that yet. > > Also, In breeze.util._ there's an implicit that adds a tee method to > iterator, and also a last method. Both are useful for things like this. > > -- David > > > On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai wrote: > >> I think I figure it out. Instead of calling minimize, and record the loss >> in the DiffFunction, I should do the following. >> >> val states = lbfgs.iterations(new CachedDiffFunction(costFun), >> initialWeights.toBreeze.toDenseVector) >> states.foreach(state => lossHistory.append(state.value)) >> >> All the losses in states should be decreasing now. Am I right? >> >> >> >> Sincerely, >> >> DB Tsai >> --- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai wrote: >> >>> Also, how many failure of rejection will terminate the optimization >>> process? How is it related to "numberOfImprovementFailures"? >>> >>> Thanks. >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> --- >>> My Blog: https://www.dbtsai.com >>> LinkedIn: https://www.linkedin.com/in/dbtsai >>> >>> >>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai wrote: >>> Hi David, I'm recording the loss history in the DiffFunction implementation, and that's why the rejected step is also recorded in my loss history. Is there any api in Breeze LBFGS to get the history which already excludes the reject step? Or should I just call "iterations" method and check "iteratingShouldStop" instead? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote: > LBFGS will not take a step that sends the objective value up. It might > try a step that is "too big" and reject it, so if you're just logging > everything that gets tried by LBFGS, you could see that. The "iterations" > method of the minimizer should never return an increasing objective value. > If you're regularizing, are you including the regularizer in the objective > value computation? > > GD is almost never worth your time. > > -- David > > On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai wrote: > >> Another interesting benchmark. >> >> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero >> elements.* >> >> LBFGS converges in 70 seconds, while GD seems to be not progressing. >> >> Dense feature vector will be too big to fit in the memory, so only >> conduct the sparse benchmark. >> >> I saw the sometimes the loss bumps up, and it's weird for me. Since >> the cost function of logistic regression is convex, it should be >> monotonically decreasing. David, any suggestion? >> >> The detail figure: >> >> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf >> >> >> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* >> >> LBFGS converges in 25 seconds, while GD also seems to be not >> progressing. >> >> Only conduct sparse benchmark for the same reason. I also saw the >> loss bumps up for unknown reason. >> >> The detail figure: >> >> https://github.com/dbtsai
Re: Spark 1.0.0 rc3
Thanks. I'm fine with the logic change, although I was a bit surprised to see Hadoop used for file I/O. Anyway, the jira issue and pull request discussions mention a flag to enable overwrites. That would be very convenient for a tutorial I'm writing, although I wouldn't recommend it for normal use, of course. However, I can't figure out if this actually exists. I found the spark.files.overwrite property, but that doesn't apply. Does this override flag, method call, or method argument actually exist? Thanks, Dean On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell wrote: > Hi Dean, > > We always used the Hadoop libraries here to read and write local > files. In Spark 1.0 we started enforcing the rule that you can't > over-write an existing directory because it can cause > confusing/undefined behavior if multiple jobs output to the directory > (they partially clobber each other's output). > > https://issues.apache.org/jira/browse/SPARK-1100 > https://github.com/apache/spark/pull/11 > > In the JIRA I actually proposed slightly deviating from Hadoop > semantics and allowing the directory to exist if it is empty, but I > think in the end we decided to just go with the exact same semantics > as Hadoop (i.e. empty directories are a problem). > > - Patrick > > On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler > wrote: > > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's > using > > HDFS classes for file I/O, while the same script compiled and running > with > > 0.9.1 uses only the local-mode File IO. > > > > The script is a variation of the Word Count script. Here are the "guts": > > > > object WordCount2 { > > def main(args: Array[String]) = { > > > > val sc = new SparkContext("local", "Word Count (2)") > > > > val input = sc.textFile(".../some/local/file").map(line => > > line.toLowerCase) > > input.cache > > > > val wc2 = input > > .flatMap(line => line.split("""\W+""")) > > .map(word => (word, 1)) > > .reduceByKey((count1, count2) => count1 + count2) > > > > wc2.saveAsTextFile("output/some/directory") > > > > sc.stop() > > > > It works fine compiled and executed with 0.9.1. If I recompile and run > with > > 1.0.0-RC1, where the same output directory still exists, I get this > > familiar Hadoop-ish exception: > > > > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: > > Output directory > > > file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc > > already exists > > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > > > file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc > > already exists > > at > > > org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) > > at > > > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) > > at > > > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) > > at > > > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) > > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) > > at spark.activator.WordCount2$.main(WordCount2.scala:42) > > at spark.activator.WordCount2.main(WordCount2.scala) > > ... > > > > Thoughts? > > > > > > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell > wrote: > > > >> Hey All, > >> > >> This is not an official vote, but I wanted to cut an RC so that people > can > >> test against the Maven artifacts, test building with their > configuration, > >> etc. We are still chasing down a few issues and updating docs, etc. > >> > >> If you have issues or bug reports for this release, please send an > e-mail > >> to the Spark dev list and/or file a JIRA. > >> > >> Commit: d636772 (v1.0.0-rc3) > >> > >> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 > >> > >> Binaries: > >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/ > >> > >> Docs: > >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ > >> > >> Repository: > >> https://repository.apache.org/content/repositories/orgapachespark-1012/ > >> > >> == API Changes == > >> If you want to test building against Spark there are some minor API > >> changes. We'll get these written up for the final release but I'm > noting a > >> few here (not comprehensive): > >> > >> changes to ML vector specification: > >> > >> > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 > >> > >> changes to the Java API: > >> > >> > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > >> > >> coGroup and related functions now return Iterable[T] instead of Seq[T] > >> ==> Call toSeq on the result to restore the old behavior > >> > >> SparkContext.jarOfClass returns Option[String] instead of Seq[String] > >> ==> Call toSeq on the result to restore old behavior > >> > >> S
Re: Spark 1.0.0 rc3
Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler wrote: > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using > HDFS classes for file I/O, while the same script compiled and running with > 0.9.1 uses only the local-mode File IO. > > The script is a variation of the Word Count script. Here are the "guts": > > object WordCount2 { > def main(args: Array[String]) = { > > val sc = new SparkContext("local", "Word Count (2)") > > val input = sc.textFile(".../some/local/file").map(line => > line.toLowerCase) > input.cache > > val wc2 = input > .flatMap(line => line.split("""\W+""")) > .map(word => (word, 1)) > .reduceByKey((count1, count2) => count1 + count2) > > wc2.saveAsTextFile("output/some/directory") > > sc.stop() > > It works fine compiled and executed with 0.9.1. If I recompile and run with > 1.0.0-RC1, where the same output directory still exists, I get this > familiar Hadoop-ish exception: > > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: > Output directory > file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc > already exists > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc > already exists > at > org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) > at spark.activator.WordCount2$.main(WordCount2.scala:42) > at spark.activator.WordCount2.main(WordCount2.scala) > ... > > Thoughts? > > > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell wrote: > >> Hey All, >> >> This is not an official vote, but I wanted to cut an RC so that people can >> test against the Maven artifacts, test building with their configuration, >> etc. We are still chasing down a few issues and updating docs, etc. >> >> If you have issues or bug reports for this release, please send an e-mail >> to the Spark dev list and/or file a JIRA. >> >> Commit: d636772 (v1.0.0-rc3) >> >> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 >> >> Binaries: >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/ >> >> Docs: >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ >> >> Repository: >> https://repository.apache.org/content/repositories/orgapachespark-1012/ >> >> == API Changes == >> If you want to test building against Spark there are some minor API >> changes. We'll get these written up for the final release but I'm noting a >> few here (not comprehensive): >> >> changes to ML vector specification: >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 >> >> changes to the Java API: >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark >> >> coGroup and related functions now return Iterable[T] instead of Seq[T] >> ==> Call toSeq on the result to restore the old behavior >> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String] >> ==> Call toSeq on the result to restore old behavior >> >> Streaming classes have been renamed: >> NetworkReceiver -> Receiver >> > > > > -- > Dean Wampler, Ph.D. > Typesafe > @deanwampler > http://typesafe.com > http://polyglotprogramming.com
Re: Spark 1.0.0 rc3
Sorry got cut off. For 0.9.0 and 1.0.0 they are not binary compatible and in a few cases not source compatible. 1.X will be source compatible. We are also planning to support binary compatibility in 1.X but I'm waiting util we make a few releases to officially promise that, since Scala makes this pretty tricky. On Tue, Apr 29, 2014 at 11:47 AM, Patrick Wendell wrote: >> What are the expectations / guarantees on binary compatibility between >> 0.9 and 1.0? > > There are not guarantees.
Re: Spark 1.0.0 rc3
> What are the expectations / guarantees on binary compatibility between > 0.9 and 1.0? There are not guarantees.
Re: Spark 1.0.0 rc3
Hi Patrick, What are the expectations / guarantees on binary compatibility between 0.9 and 1.0? You mention some API changes, which kinda hint that binary compatibility has already been broken, but just wanted to point out there are other cases. e.g.: Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:236) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.rddToOrderedRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/Function1;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/OrderedRDDFunctions; (Compiled against 0.9, run against 1.0.) Offending code: val top10 = counts.sortByKey(false).take(10) Recompiling fixes the problem. On Tue, Apr 29, 2014 at 1:05 AM, Patrick Wendell wrote: > Hey All, > > This is not an official vote, but I wanted to cut an RC so that people can > test against the Maven artifacts, test building with their configuration, > etc. We are still chasing down a few issues and updating docs, etc. > > If you have issues or bug reports for this release, please send an e-mail > to the Spark dev list and/or file a JIRA. > > Commit: d636772 (v1.0.0-rc3) > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 > > Binaries: > http://people.apache.org/~pwendell/spark-1.0.0-rc3/ > > Docs: > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ > > Repository: > https://repository.apache.org/content/repositories/orgapachespark-1012/ > > == API Changes == > If you want to test building against Spark there are some minor API > changes. We'll get these written up for the final release but I'm noting a > few here (not comprehensive): > > changes to ML vector specification: > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 > > changes to the Java API: > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > coGroup and related functions now return Iterable[T] instead of Seq[T] > ==> Call toSeq on the result to restore the old behavior > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > ==> Call toSeq on the result to restore old behavior > > Streaming classes have been renamed: > NetworkReceiver -> Receiver -- Marcelo
Code Review for SPARK-1516: Throw exception in yarn client instead of System.exit
Hi All, Since we're launching Spark Yarn Job in our tomcat application, the default behavior of calling System.exit when job is finished or runs into any error isn't desirable. We create this PR https://github.com/apache/spark/pull/490 to address this issue. Since the logical is fairly straightforward, we wonder if this can be reviewed and have this in 1.0. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: Spark 1.0.0 rc3
I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the "guts": object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext("local", "Word Count (2)") val input = sc.textFile(".../some/local/file").map(line => line.toLowerCase) input.cache val wc2 = input .flatMap(line => line.split("""\W+""")) .map(word => (word, 1)) .reduceByKey((count1, count2) => count1 + count2) wc2.saveAsTextFile("output/some/directory") sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell wrote: > Hey All, > > This is not an official vote, but I wanted to cut an RC so that people can > test against the Maven artifacts, test building with their configuration, > etc. We are still chasing down a few issues and updating docs, etc. > > If you have issues or bug reports for this release, please send an e-mail > to the Spark dev list and/or file a JIRA. > > Commit: d636772 (v1.0.0-rc3) > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 > > Binaries: > http://people.apache.org/~pwendell/spark-1.0.0-rc3/ > > Docs: > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ > > Repository: > https://repository.apache.org/content/repositories/orgapachespark-1012/ > > == API Changes == > If you want to test building against Spark there are some minor API > changes. We'll get these written up for the final release but I'm noting a > few here (not comprehensive): > > changes to ML vector specification: > > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 > > changes to the Java API: > > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > coGroup and related functions now return Iterable[T] instead of Seq[T] > ==> Call toSeq on the result to restore the old behavior > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > ==> Call toSeq on the result to restore old behavior > > Streaming classes have been renamed: > NetworkReceiver -> Receiver > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Spark 1.0.0 rc3
Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] ==> Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] ==> Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver -> Receiver