Re: Spark 1.0.0 rc3

Patrick Wendell Tue, 29 Apr 2014 11:55:26 -0700

Hi Dean,

We always used the Hadoop libraries here to read and write local
files. In Spark 1.0 we started enforcing the rule that you can't
over-write an existing directory because it can cause
confusing/undefined behavior if multiple jobs output to the directory
(they partially clobber each other's output).


https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

In the JIRA I actually proposed slightly deviating from Hadoop
semantics and allowing the directory to exist if it is empty, but I
think in the end we decided to just go with the exact same semantics
as Hadoop (i.e. empty directories are a problem).

- Patrick

On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler <deanwamp...@gmail.com> wrote:
> I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
> HDFS classes for file I/O, while the same script compiled and running with
> 0.9.1 uses only the local-mode File IO.
>
> The script is a variation of the Word Count script. Here are the "guts":
>
> object WordCount2 {
>   def main(args: Array[String]) = {
>
>     val sc = new SparkContext("local", "Word Count (2)")
>
>     val input = sc.textFile(".../some/local/file").map(line =>
> line.toLowerCase)
>     input.cache
>
>     val wc2 = input
>       .flatMap(line => line.split("""\W+"""))
>       .map(word => (word, 1))
>       .reduceByKey((count1, count2) => count1 + count2)
>
>     wc2.saveAsTextFile("output/some/directory")
>
>     sc.stop()
>
> It works fine compiled and executed with 0.9.1. If I recompile and run with
> 1.0.0-RC1, where the same output directory still exists, I get this
> familiar Hadoop-ish exception:
>
> [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
>  at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>  at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> at spark.activator.WordCount2$.main(WordCount2.scala:42)
>  at spark.activator.WordCount2.main(WordCount2.scala)
> ...
>
> Thoughts?
>
>
> On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pwend...@gmail.com> wrote:
>
>> Hey All,
>>
>> This is not an official vote, but I wanted to cut an RC so that people can
>> test against the Maven artifacts, test building with their configuration,
>> etc. We are still chasing down a few issues and updating docs, etc.
>>
>> If you have issues or bug reports for this release, please send an e-mail
>> to the Spark dev list and/or file a JIRA.
>>
>> Commit: d636772 (v1.0.0-rc3)
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>>
>> Binaries:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>>
>> Docs:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>>
>> Repository:
>> https://repository.apache.org/content/repositories/orgapachespark-1012/
>>
>> == API Changes ==
>> If you want to test building against Spark there are some minor API
>> changes. We'll get these written up for the final release but I'm noting a
>> few here (not comprehensive):
>>
>> changes to ML vector specification:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>>
>> changes to the Java API:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> ==> Call toSeq on the result to restore old behavior
>>
>> Streaming classes have been renamed:
>> NetworkReceiver -> Receiver
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Reply via email to