Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
That suggestion got lost along the way and IIRC the patch didn't have
that. It's a good idea though, if nothing else to provide a simple
means for backwards compatibility.

I created a JIRA for this. It's very straightforward so maybe someone
can pick it up quickly:
https://issues.apache.org/jira/browse/SPARK-1677


On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler  wrote:
> Thanks. I'm fine with the logic change, although I was a bit surprised to
> see Hadoop used for file I/O.
>
> Anyway, the jira issue and pull request discussions mention a flag to
> enable overwrites. That would be very convenient for a tutorial I'm
> writing, although I wouldn't recommend it for normal use, of course.
> However, I can't figure out if this actually exists. I found the
> spark.files.overwrite property, but that doesn't apply.  Does this override
> flag, method call, or method argument actually exist?
>
> Thanks,
> Dean
>
>
> On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell  wrote:
>
>> Hi Dean,
>>
>> We always used the Hadoop libraries here to read and write local
>> files. In Spark 1.0 we started enforcing the rule that you can't
>> over-write an existing directory because it can cause
>> confusing/undefined behavior if multiple jobs output to the directory
>> (they partially clobber each other's output).
>>
>> https://issues.apache.org/jira/browse/SPARK-1100
>> https://github.com/apache/spark/pull/11
>>
>> In the JIRA I actually proposed slightly deviating from Hadoop
>> semantics and allowing the directory to exist if it is empty, but I
>> think in the end we decided to just go with the exact same semantics
>> as Hadoop (i.e. empty directories are a problem).
>>
>> - Patrick
>>
>> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler 
>> wrote:
>> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
>> using
>> > HDFS classes for file I/O, while the same script compiled and running
>> with
>> > 0.9.1 uses only the local-mode File IO.
>> >
>> > The script is a variation of the Word Count script. Here are the "guts":
>> >
>> > object WordCount2 {
>> >   def main(args: Array[String]) = {
>> >
>> > val sc = new SparkContext("local", "Word Count (2)")
>> >
>> > val input = sc.textFile(".../some/local/file").map(line =>
>> > line.toLowerCase)
>> > input.cache
>> >
>> > val wc2 = input
>> >   .flatMap(line => line.split("""\W+"""))
>> >   .map(word => (word, 1))
>> >   .reduceByKey((count1, count2) => count1 + count2)
>> >
>> > wc2.saveAsTextFile("output/some/directory")
>> >
>> > sc.stop()
>> >
>> > It works fine compiled and executed with 0.9.1. If I recompile and run
>> with
>> > 1.0.0-RC1, where the same output directory still exists, I get this
>> > familiar Hadoop-ish exception:
>> >
>> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
>> > Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> >  at
>> >
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>> >  at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
>> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
>> >  at spark.activator.WordCount2.main(WordCount2.scala)
>> > ...
>> >
>> > Thoughts?
>> >
>> >
>> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell 
>> wrote:
>> >
>> >> Hey All,
>> >>
>> >> This is not an official vote, but I wanted to cut an RC so that people
>> can
>> >> test against the Maven artifacts, test building with their
>> configuration,
>> >> etc. We are still chasing down a few issues and updating docs, etc.
>> >>
>> >> If you have issues or bug reports for this release, please send an
>> e-mail
>> >> to the Spark dev list and/or file a JIRA.
>> >>
>> >> Commit: d636772 (v1.0.0-rc3)
>> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>> >>
>> >> Binaries:
>> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>> >>
>> >> Docs:
>> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>> >>
>> >> Repository:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1012/
>> >>
>> >> == API Changes ==
>> >> If you want to test building against Spark there are some minor API
>> >> changes. We'll get these written up for the final release but I'm
>> noting a
>> >> few here (not comprehensive):
>> >>
>> >> changes to ML vector specification:
>> >>
>> >>
>> http://people.apache.org/~pwende

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-29 Thread DB Tsai
Yeah, the approximation of hssian in LBFGS isn't stateless, and it does
depend on previous LBFGS step as Xiangrui also pointed out. It's surprising
that it works without error message. I also saw the loss is fluctuating
like SGD during the training.

We will remove the miniBatch mode in LBFGS in Spark before we've deeper
understanding of how "stochastic" LBFGS works.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Apr 29, 2014 at 9:50 PM, David Hall  wrote:

> Yeah, that's probably the easiest though obviously pretty hacky.
>
> I'm surprised that the hessian approximation isn't worse than it is. (As
> in, I'd expect error messages.) It's obviously line searching much more, so
> the approximation must be worse. You might be interested in this online
> bfgs:
> http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf
>
> -- David
>
>
> On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai  wrote:
>
>> Have a quick hack to understand the behavior of SLBFGS
>> (Stochastic-LBFGS) by overwriting the breeze iterations method to get the
>> current LBFGS step to ensure that the objective function is the same during
>> the line search step. David, the following is my code, have a better way to
>> inject into it?
>>
>> https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack
>>
>> Couple findings,
>>
>> 1) miniBatch (using rdd sample api) for each iteration is slower than
>> full data training when the full data is cached. Probably because sample is
>> not efficiency in Spark.
>>
>> 2) Since in the line search steps, we use the same sample of data (the
>> same objective function), the SLBFGS actually converges well.
>>
>> 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps
>> (29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS
>> with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6
>> seconds) to converge to loss < 0.10.
>>
>> It seems that as long as the noisy gradient happens in different SLBFGS
>> steps, it still works.
>>
>> (ps, I also tried in line search step, I use different sample of data,
>> and it just doesn't work as we expect.)
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Mon, Apr 28, 2014 at 8:55 AM, David Hall  wrote:
>>
>>> That's right.
>>>
>>> FWIW, caching should be automatic now, but it might be the version of
>>> Breeze you're using doesn't do that yet.
>>>
>>> Also, In breeze.util._ there's an implicit that adds a tee method to
>>> iterator, and also a last method. Both are useful for things like this.
>>>
>>> -- David
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai  wrote:
>>>
 I think I figure it out. Instead of calling minimize, and record the
 loss in the DiffFunction, I should do the following.

 val states = lbfgs.iterations(new CachedDiffFunction(costFun),
 initialWeights.toBreeze.toDenseVector)
 states.foreach(state => lossHistory.append(state.value))

 All the losses in states should be decreasing now. Am I right?



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai  wrote:

> Also, how many failure of rejection will terminate the optimization
> process? How is it related to "numberOfImprovementFailures"?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai  wrote:
>
>> Hi David,
>>
>> I'm recording the loss history in the DiffFunction implementation,
>> and that's why the rejected step is also recorded in my loss history.
>>
>> Is there any api in Breeze LBFGS to get the history which already
>> excludes the reject step? Or should I just call "iterations" method and
>> check "iteratingShouldStop" instead?
>>
>> Thanks.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote:
>>
>>> LBFGS will not take a step that sends the objective value up. It
>>> might try a step that is "too big" and reject it, so if you're just 
>>> logging
>>> everything that gets tried by LBFGS, you could see that. The 
>>> "iterations"
>>> method of the minimizer should never return an increasing objective 
>>> value.
>>> If yo

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-29 Thread David Hall
Yeah, that's probably the easiest though obviously pretty hacky.

I'm surprised that the hessian approximation isn't worse than it is. (As
in, I'd expect error messages.) It's obviously line searching much more, so
the approximation must be worse. You might be interested in this online
bfgs:
http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf

-- David


On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai  wrote:

> Have a quick hack to understand the behavior of SLBFGS
> (Stochastic-LBFGS) by overwriting the breeze iterations method to get the
> current LBFGS step to ensure that the objective function is the same during
> the line search step. David, the following is my code, have a better way to
> inject into it?
>
> https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack
>
> Couple findings,
>
> 1) miniBatch (using rdd sample api) for each iteration is slower than full
> data training when the full data is cached. Probably because sample is not
> efficiency in Spark.
>
> 2) Since in the line search steps, we use the same sample of data (the
> same objective function), the SLBFGS actually converges well.
>
> 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps
> (29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS
> with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6
> seconds) to converge to loss < 0.10.
>
> It seems that as long as the noisy gradient happens in different SLBFGS
> steps, it still works.
>
> (ps, I also tried in line search step, I use different sample of data, and
> it just doesn't work as we expect.)
>
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, Apr 28, 2014 at 8:55 AM, David Hall  wrote:
>
>> That's right.
>>
>> FWIW, caching should be automatic now, but it might be the version of
>> Breeze you're using doesn't do that yet.
>>
>> Also, In breeze.util._ there's an implicit that adds a tee method to
>> iterator, and also a last method. Both are useful for things like this.
>>
>> -- David
>>
>>
>> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai  wrote:
>>
>>> I think I figure it out. Instead of calling minimize, and record the
>>> loss in the DiffFunction, I should do the following.
>>>
>>> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
>>> initialWeights.toBreeze.toDenseVector)
>>> states.foreach(state => lossHistory.append(state.value))
>>>
>>> All the losses in states should be decreasing now. Am I right?
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai  wrote:
>>>
 Also, how many failure of rejection will terminate the optimization
 process? How is it related to "numberOfImprovementFailures"?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai  wrote:

> Hi David,
>
> I'm recording the loss history in the DiffFunction implementation, and
> that's why the rejected step is also recorded in my loss history.
>
> Is there any api in Breeze LBFGS to get the history which already
> excludes the reject step? Or should I just call "iterations" method and
> check "iteratingShouldStop" instead?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote:
>
>> LBFGS will not take a step that sends the objective value up. It
>> might try a step that is "too big" and reject it, so if you're just 
>> logging
>> everything that gets tried by LBFGS, you could see that. The "iterations"
>> method of the minimizer should never return an increasing objective 
>> value.
>> If you're regularizing, are you including the regularizer in the 
>> objective
>> value computation?
>>
>> GD is almost never worth your time.
>>
>> -- David
>>
>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai  wrote:
>>
>>> Another interesting benchmark.
>>>
>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>>> elements.*
>>>
>>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>>
>>> Dense feature vector will be too big to fit in the memory, so only
>>> conduct the sparse benchmark.
>>>
>>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>>> the cost function of logistic regression is c

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-29 Thread DB Tsai
Have a quick hack to understand the behavior of SLBFGS
(Stochastic-LBFGS) by overwriting the breeze iterations method to get the
current LBFGS step to ensure that the objective function is the same during
the line search step. David, the following is my code, have a better way to
inject into it?

https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack

Couple findings,

1) miniBatch (using rdd sample api) for each iteration is slower than full
data training when the full data is cached. Probably because sample is not
efficiency in Spark.

2) Since in the line search steps, we use the same sample of data (the same
objective function), the SLBFGS actually converges well.

3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps
(29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS
with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6
seconds) to converge to loss < 0.10.

It seems that as long as the noisy gradient happens in different SLBFGS
steps, it still works.

(ps, I also tried in line search step, I use different sample of data, and
it just doesn't work as we expect.)



Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Apr 28, 2014 at 8:55 AM, David Hall  wrote:

> That's right.
>
> FWIW, caching should be automatic now, but it might be the version of
> Breeze you're using doesn't do that yet.
>
> Also, In breeze.util._ there's an implicit that adds a tee method to
> iterator, and also a last method. Both are useful for things like this.
>
> -- David
>
>
> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai  wrote:
>
>> I think I figure it out. Instead of calling minimize, and record the loss
>> in the DiffFunction, I should do the following.
>>
>> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
>> initialWeights.toBreeze.toDenseVector)
>> states.foreach(state => lossHistory.append(state.value))
>>
>> All the losses in states should be decreasing now. Am I right?
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai  wrote:
>>
>>> Also, how many failure of rejection will terminate the optimization
>>> process? How is it related to "numberOfImprovementFailures"?
>>>
>>> Thanks.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai  wrote:
>>>
 Hi David,

 I'm recording the loss history in the DiffFunction implementation, and
 that's why the rejected step is also recorded in my loss history.

 Is there any api in Breeze LBFGS to get the history which already
 excludes the reject step? Or should I just call "iterations" method and
 check "iteratingShouldStop" instead?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote:

> LBFGS will not take a step that sends the objective value up. It might
> try a step that is "too big" and reject it, so if you're just logging
> everything that gets tried by LBFGS, you could see that. The "iterations"
> method of the minimizer should never return an increasing objective value.
> If you're regularizing, are you including the regularizer in the objective
> value computation?
>
> GD is almost never worth your time.
>
> -- David
>
> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai  wrote:
>
>> Another interesting benchmark.
>>
>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>> elements.*
>>
>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>
>> Dense feature vector will be too big to fit in the memory, so only
>> conduct the sparse benchmark.
>>
>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>> the cost function of logistic regression is convex, it should be
>> monotonically decreasing.  David, any suggestion?
>>
>> The detail figure:
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>
>>
>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>
>> LBFGS converges in 25 seconds, while GD also seems to be not
>> progressing.
>>
>> Only conduct sparse benchmark for the same reason. I also saw the
>> loss bumps up for unknown reason.
>>
>> The detail figure:
>>
>> https://github.com/dbtsai

Re: Spark 1.0.0 rc3

2014-04-29 Thread Dean Wampler
Thanks. I'm fine with the logic change, although I was a bit surprised to
see Hadoop used for file I/O.

Anyway, the jira issue and pull request discussions mention a flag to
enable overwrites. That would be very convenient for a tutorial I'm
writing, although I wouldn't recommend it for normal use, of course.
However, I can't figure out if this actually exists. I found the
spark.files.overwrite property, but that doesn't apply.  Does this override
flag, method call, or method argument actually exist?

Thanks,
Dean


On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell  wrote:

> Hi Dean,
>
> We always used the Hadoop libraries here to read and write local
> files. In Spark 1.0 we started enforcing the rule that you can't
> over-write an existing directory because it can cause
> confusing/undefined behavior if multiple jobs output to the directory
> (they partially clobber each other's output).
>
> https://issues.apache.org/jira/browse/SPARK-1100
> https://github.com/apache/spark/pull/11
>
> In the JIRA I actually proposed slightly deviating from Hadoop
> semantics and allowing the directory to exist if it is empty, but I
> think in the end we decided to just go with the exact same semantics
> as Hadoop (i.e. empty directories are a problem).
>
> - Patrick
>
> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler 
> wrote:
> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
> using
> > HDFS classes for file I/O, while the same script compiled and running
> with
> > 0.9.1 uses only the local-mode File IO.
> >
> > The script is a variation of the Word Count script. Here are the "guts":
> >
> > object WordCount2 {
> >   def main(args: Array[String]) = {
> >
> > val sc = new SparkContext("local", "Word Count (2)")
> >
> > val input = sc.textFile(".../some/local/file").map(line =>
> > line.toLowerCase)
> > input.cache
> >
> > val wc2 = input
> >   .flatMap(line => line.split("""\W+"""))
> >   .map(word => (word, 1))
> >   .reduceByKey((count1, count2) => count1 + count2)
> >
> > wc2.saveAsTextFile("output/some/directory")
> >
> > sc.stop()
> >
> > It works fine compiled and executed with 0.9.1. If I recompile and run
> with
> > 1.0.0-RC1, where the same output directory still exists, I get this
> > familiar Hadoop-ish exception:
> >
> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
> > Output directory
> >
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> > already exists
> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> >
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> > already exists
> >  at
> >
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> > at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
> >  at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> > at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
> >  at spark.activator.WordCount2.main(WordCount2.scala)
> > ...
> >
> > Thoughts?
> >
> >
> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell 
> wrote:
> >
> >> Hey All,
> >>
> >> This is not an official vote, but I wanted to cut an RC so that people
> can
> >> test against the Maven artifacts, test building with their
> configuration,
> >> etc. We are still chasing down a few issues and updating docs, etc.
> >>
> >> If you have issues or bug reports for this release, please send an
> e-mail
> >> to the Spark dev list and/or file a JIRA.
> >>
> >> Commit: d636772 (v1.0.0-rc3)
> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
> >>
> >> Binaries:
> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
> >>
> >> Docs:
> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
> >>
> >> Repository:
> >> https://repository.apache.org/content/repositories/orgapachespark-1012/
> >>
> >> == API Changes ==
> >> If you want to test building against Spark there are some minor API
> >> changes. We'll get these written up for the final release but I'm
> noting a
> >> few here (not comprehensive):
> >>
> >> changes to ML vector specification:
> >>
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
> >>
> >> changes to the Java API:
> >>
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> >>
> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
> >> ==> Call toSeq on the result to restore the old behavior
> >>
> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> >> ==> Call toSeq on the result to restore old behavior
> >>
> >> S

Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
Hi Dean,

We always used the Hadoop libraries here to read and write local
files. In Spark 1.0 we started enforcing the rule that you can't
over-write an existing directory because it can cause
confusing/undefined behavior if multiple jobs output to the directory
(they partially clobber each other's output).

https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

In the JIRA I actually proposed slightly deviating from Hadoop
semantics and allowing the directory to exist if it is empty, but I
think in the end we decided to just go with the exact same semantics
as Hadoop (i.e. empty directories are a problem).

- Patrick

On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler  wrote:
> I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
> HDFS classes for file I/O, while the same script compiled and running with
> 0.9.1 uses only the local-mode File IO.
>
> The script is a variation of the Word Count script. Here are the "guts":
>
> object WordCount2 {
>   def main(args: Array[String]) = {
>
> val sc = new SparkContext("local", "Word Count (2)")
>
> val input = sc.textFile(".../some/local/file").map(line =>
> line.toLowerCase)
> input.cache
>
> val wc2 = input
>   .flatMap(line => line.split("""\W+"""))
>   .map(word => (word, 1))
>   .reduceByKey((count1, count2) => count1 + count2)
>
> wc2.saveAsTextFile("output/some/directory")
>
> sc.stop()
>
> It works fine compiled and executed with 0.9.1. If I recompile and run with
> 1.0.0-RC1, where the same output directory still exists, I get this
> familiar Hadoop-ish exception:
>
> [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
>  at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>  at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> at spark.activator.WordCount2$.main(WordCount2.scala:42)
>  at spark.activator.WordCount2.main(WordCount2.scala)
> ...
>
> Thoughts?
>
>
> On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell  wrote:
>
>> Hey All,
>>
>> This is not an official vote, but I wanted to cut an RC so that people can
>> test against the Maven artifacts, test building with their configuration,
>> etc. We are still chasing down a few issues and updating docs, etc.
>>
>> If you have issues or bug reports for this release, please send an e-mail
>> to the Spark dev list and/or file a JIRA.
>>
>> Commit: d636772 (v1.0.0-rc3)
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>>
>> Binaries:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>>
>> Docs:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>>
>> Repository:
>> https://repository.apache.org/content/repositories/orgapachespark-1012/
>>
>> == API Changes ==
>> If you want to test building against Spark there are some minor API
>> changes. We'll get these written up for the final release but I'm noting a
>> few here (not comprehensive):
>>
>> changes to ML vector specification:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>>
>> changes to the Java API:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> ==> Call toSeq on the result to restore old behavior
>>
>> Streaming classes have been renamed:
>> NetworkReceiver -> Receiver
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com


Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
Sorry got cut off. For 0.9.0 and 1.0.0 they are not binary compatible
and in a few cases not source compatible. 1.X will be source
compatible. We are also planning to support binary compatibility in
1.X but I'm waiting util we make a few releases to officially promise
that, since Scala makes this pretty tricky.

On Tue, Apr 29, 2014 at 11:47 AM, Patrick Wendell  wrote:
>> What are the expectations / guarantees on binary compatibility between
>> 0.9 and 1.0?
>
> There are not guarantees.


Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
> What are the expectations / guarantees on binary compatibility between
> 0.9 and 1.0?

There are not guarantees.


Re: Spark 1.0.0 rc3

2014-04-29 Thread Marcelo Vanzin
Hi Patrick,

What are the expectations / guarantees on binary compatibility between
0.9 and 1.0?

You mention some API changes, which kinda hint that binary
compatibility has already been broken, but just wanted to point out
there are other cases. e.g.:

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:236)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoSuchMethodError:
org.apache.spark.SparkContext$.rddToOrderedRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/Function1;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/OrderedRDDFunctions;

(Compiled against 0.9, run against 1.0.)
Offending code:

  val top10 = counts.sortByKey(false).take(10)

Recompiling fixes the problem.


On Tue, Apr 29, 2014 at 1:05 AM, Patrick Wendell  wrote:
> Hey All,
>
> This is not an official vote, but I wanted to cut an RC so that people can
> test against the Maven artifacts, test building with their configuration,
> etc. We are still chasing down a few issues and updating docs, etc.
>
> If you have issues or bug reports for this release, please send an e-mail
> to the Spark dev list and/or file a JIRA.
>
> Commit: d636772 (v1.0.0-rc3)
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>
> Binaries:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>
> Docs:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>
> Repository:
> https://repository.apache.org/content/repositories/orgapachespark-1012/
>
> == API Changes ==
> If you want to test building against Spark there are some minor API
> changes. We'll get these written up for the final release but I'm noting a
> few here (not comprehensive):
>
> changes to ML vector specification:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>
> Streaming classes have been renamed:
> NetworkReceiver -> Receiver



-- 
Marcelo


Code Review for SPARK-1516: Throw exception in yarn client instead of System.exit

2014-04-29 Thread DB Tsai
Hi All,

Since we're launching Spark Yarn Job in our tomcat application, the default
behavior of calling System.exit when job is finished or runs into any error
isn't desirable.

We create this PR https://github.com/apache/spark/pull/490 to address this
issue. Since the logical is fairly straightforward, we wonder if this can
be reviewed and have this in 1.0.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


Re: Spark 1.0.0 rc3

2014-04-29 Thread Dean Wampler
I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
HDFS classes for file I/O, while the same script compiled and running with
0.9.1 uses only the local-mode File IO.

The script is a variation of the Word Count script. Here are the "guts":

object WordCount2 {
  def main(args: Array[String]) = {

val sc = new SparkContext("local", "Word Count (2)")

val input = sc.textFile(".../some/local/file").map(line =>
line.toLowerCase)
input.cache

val wc2 = input
  .flatMap(line => line.split("""\W+"""))
  .map(word => (word, 1))
  .reduceByKey((count1, count2) => count1 + count2)

wc2.saveAsTextFile("output/some/directory")

sc.stop()

It works fine compiled and executed with 0.9.1. If I recompile and run with
1.0.0-RC1, where the same output directory still exists, I get this
familiar Hadoop-ish exception:

[error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory
file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
already exists
 at
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
 at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
at spark.activator.WordCount2$.main(WordCount2.scala:42)
 at spark.activator.WordCount2.main(WordCount2.scala)
...

Thoughts?


On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell  wrote:

> Hey All,
>
> This is not an official vote, but I wanted to cut an RC so that people can
> test against the Maven artifacts, test building with their configuration,
> etc. We are still chasing down a few issues and updating docs, etc.
>
> If you have issues or bug reports for this release, please send an e-mail
> to the Spark dev list and/or file a JIRA.
>
> Commit: d636772 (v1.0.0-rc3)
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>
> Binaries:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>
> Docs:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>
> Repository:
> https://repository.apache.org/content/repositories/orgapachespark-1012/
>
> == API Changes ==
> If you want to test building against Spark there are some minor API
> changes. We'll get these written up for the final release but I'm noting a
> few here (not comprehensive):
>
> changes to ML vector specification:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>
> Streaming classes have been renamed:
> NetworkReceiver -> Receiver
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
Hey All,

This is not an official vote, but I wanted to cut an RC so that people can
test against the Maven artifacts, test building with their configuration,
etc. We are still chasing down a few issues and updating docs, etc.

If you have issues or bug reports for this release, please send an e-mail
to the Spark dev list and/or file a JIRA.

Commit: d636772 (v1.0.0-rc3)
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221

Binaries:
http://people.apache.org/~pwendell/spark-1.0.0-rc3/

Docs:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/

Repository:
https://repository.apache.org/content/repositories/orgapachespark-1012/

== API Changes ==
If you want to test building against Spark there are some minor API
changes. We'll get these written up for the final release but I'm noting a
few here (not comprehensive):

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Streaming classes have been renamed:
NetworkReceiver -> Receiver