Hi Jan,
Is the error because a past run of the job has already written to the
location?

In that case you can add more granularity with 'time' along with year and
month. That should give you a distinct path for every run.

Let us know if it helps or if i missed anything.

Goodluck

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 2:31 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:

> Hi,
> I'm stuck with writing partitioned data to hdfs. Example below ends up
> with 'already exists' -error.
>
> I'm wondering how to handle streaming use case.
>
> What is the intended way to write streaming data to hdfs? What am I
> missing?
>
> cheers,
> -jan
>
>
> import com.databricks.spark.avro._
>
> import org.apache.spark.sql.SQLContext
>
> val sqlContext = new SQLContext(sc)
>
> import sqlContext.implicits._
>
> val df = Seq(
> (2012, 8, "Batman", 9.8),
> (2012, 8, "Hero", 8.7),
> (2012, 7, "Robot", 5.5),
> (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
>
> df.write.partitionBy("year", "month").avro("/tmp/data")
>
> val df2 = Seq(
> (2012, 10, "Batman", 9.8),
> (2012, 10, "Hero", 8.7),
> (2012, 9, "Robot", 5.5),
> (2011, 9, "Git", 2.0)).toDF("year", "month", "title", "rating")
>
> df2.write.partitionBy("year", "month").avro("/tmp/data")
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to