Re: Spark 1.0.0 rc3

Patrick Wendell Tue, 29 Apr 2014 22:44:27 -0700

That suggestion got lost along the way and IIRC the patch didn't have
that. It's a good idea though, if nothing else to provide a simple
means for backwards compatibility.


I created a JIRA for this. It's very straightforward so maybe someone
can pick it up quickly:
https://issues.apache.org/jira/browse/SPARK-1677


On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler <deanwamp...@gmail.com> wrote:
> Thanks. I'm fine with the logic change, although I was a bit surprised to
> see Hadoop used for file I/O.
>
> Anyway, the jira issue and pull request discussions mention a flag to
> enable overwrites. That would be very convenient for a tutorial I'm
> writing, although I wouldn't recommend it for normal use, of course.
> However, I can't figure out if this actually exists. I found the
> spark.files.overwrite property, but that doesn't apply.  Does this override
> flag, method call, or method argument actually exist?
>
> Thanks,
> Dean
>
>
> On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>
>> Hi Dean,
>>
>> We always used the Hadoop libraries here to read and write local
>> files. In Spark 1.0 we started enforcing the rule that you can't
>> over-write an existing directory because it can cause
>> confusing/undefined behavior if multiple jobs output to the directory
>> (they partially clobber each other's output).
>>
>> https://issues.apache.org/jira/browse/SPARK-1100
>> https://github.com/apache/spark/pull/11
>>
>> In the JIRA I actually proposed slightly deviating from Hadoop
>> semantics and allowing the directory to exist if it is empty, but I
>> think in the end we decided to just go with the exact same semantics
>> as Hadoop (i.e. empty directories are a problem).
>>
>> - Patrick
>>
>> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler <deanwamp...@gmail.com>
>> wrote:
>> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
>> using
>> > HDFS classes for file I/O, while the same script compiled and running
>> with
>> > 0.9.1 uses only the local-mode File IO.
>> >
>> > The script is a variation of the Word Count script. Here are the "guts":
>> >
>> > object WordCount2 {
>> >   def main(args: Array[String]) = {
>> >
>> >     val sc = new SparkContext("local", "Word Count (2)")
>> >
>> >     val input = sc.textFile(".../some/local/file").map(line =>
>> > line.toLowerCase)
>> >     input.cache
>> >
>> >     val wc2 = input
>> >       .flatMap(line => line.split("""\W+"""))
>> >       .map(word => (word, 1))
>> >       .reduceByKey((count1, count2) => count1 + count2)
>> >
>> >     wc2.saveAsTextFile("output/some/directory")
>> >
>> >     sc.stop()
>> >
>> > It works fine compiled and executed with 0.9.1. If I recompile and run
>> with
>> > 1.0.0-RC1, where the same output directory still exists, I get this
>> > familiar Hadoop-ish exception:
>> >
>> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
>> > Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> >  at
>> >
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>> >  at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
>> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
>> >  at spark.activator.WordCount2.main(WordCount2.scala)
>> > ...
>> >
>> > Thoughts?
>> >
>> >
>> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>> >
>> >> Hey All,
>> >>
>> >> This is not an official vote, but I wanted to cut an RC so that people
>> can
>> >> test against the Maven artifacts, test building with their
>> configuration,
>> >> etc. We are still chasing down a few issues and updating docs, etc.
>> >>
>> >> If you have issues or bug reports for this release, please send an
>> e-mail
>> >> to the Spark dev list and/or file a JIRA.
>> >>
>> >> Commit: d636772 (v1.0.0-rc3)
>> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>> >>
>> >> Binaries:
>> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>> >>
>> >> Docs:
>> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>> >>
>> >> Repository:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1012/
>> >>
>> >> == API Changes ==
>> >> If you want to test building against Spark there are some minor API
>> >> changes. We'll get these written up for the final release but I'm
>> noting a
>> >> few here (not comprehensive):
>> >>
>> >> changes to ML vector specification:
>> >>
>> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>> >>
>> >> changes to the Java API:
>> >>
>> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>> >>
>> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> >> ==> Call toSeq on the result to restore the old behavior
>> >>
>> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> >> ==> Call toSeq on the result to restore old behavior
>> >>
>> >> Streaming classes have been renamed:
>> >> NetworkReceiver -> Receiver
>> >>
>> >
>> >
>> >
>> > --
>> > Dean Wampler, Ph.D.
>> > Typesafe
>> > @deanwampler
>> > http://typesafe.com
>> > http://polyglotprogramming.com
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Reply via email to