Re: Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread nirandap
Hi Yash, Yes, AFAIK, that is the expected behavior of the Overwrite mode. I think you can use the following approaches if you want to perform a job on each partitions [1] for each partition in DF :

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread Yash Sharma
Hi All, While writing a partitioned data frame as partitioned text files I see that Spark deletes all available partitions while writing few new partitions. dataDF.write.partitionBy(“year”, “month”, > “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”) Is this an expected behavior

Stopping Spark executors

2016-07-06 Thread Mr rty ff
HiI like to recreate this bug https://issues.apache.org/jira/browse/SPARK-13979They talking about stopping Spark executors.Its not clear exactly how do I stop the executorsThanks

[PySPARK] - Py4J binary transfer survey

2016-07-06 Thread Holden Karau
Hi PySpark Devs, The Py4j developer has a survey up for Py4J users - https://github.com/bartdag/py4j/issues/237 it might be worth our time to provide some input on how we are using and would like to be using Py4J if binary transfer was improved. I'm happy to fill it out with my thoughts - but if

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Ted Yu
Running the following command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.0 package The build stopped with this test failure: ^[[31m- SPARK-9757 Persist Parquet relation with decimal column *** FAILED ***^[[0m On Wed, Jul 6, 2016 at 6:25 AM,

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Cody Koeninger
I know some usages of the 0.10 kafka connector will be broken until https://github.com/apache/spark/pull/14026 is merged, but the 0.10 connector is a new feature, so not blocking. Sean I'm assuming the DirectKafkaStreamSuite failure you saw was for 0.8? I'll take another look at it. On Wed,

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Sean Owen
Yeah we still have some blockers; I agree SPARK-16379 is a blocker which came up yesterday. We also have 5 existing blockers, all doc related: SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella SPARK-14812 ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit SPARK-14816

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Maciej Bryński
-1 https://issues.apache.org/jira/browse/SPARK-16379 https://issues.apache.org/jira/browse/SPARK-16371 2016-07-06 7:35 GMT+02:00 Reynold Xin : > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, July 8, 2016 at

Re: Why's ds.foreachPartition(println) not possible?

2016-07-06 Thread Jacek Laskowski
Thanks Cody, Reynold, and Ryan! Learnt a lot and feel "corrected". Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jul 6, 2016 at 2:46 AM, Shixiong(Ryan) Zhu

Re: Spark Task failure with File segment length as negative

2016-07-06 Thread Priya Ch
Is anyone resolved this ? Thanks, Padma CH On Wed, Jun 22, 2016 at 4:39 PM, Priya Ch wrote: > Hi All, > > I am running Spark Application with 1.8TB of data (which is stored in Hive > tables format). I am reading the data using HiveContect and processing it. >

Re: MinMaxScaler With features include category variables

2016-07-06 Thread Yuhao Yang
You may also find VectorSlicer and SQLTransformer useful in your case. Just out of curiosity, how would you typically handles categorical features, except for OneHotEncoder. Regards, Yuhao 2016-07-01 4:00 GMT-07:00 Yanbo Liang : > You can combine the columns which are need