Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Wenchen Fan
IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only scan all table files only once, and write back the inferred schema to metastore so that we don't need to do the schema inference again. So technically this will introduce a performance regression for the first query, but

Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-21 Thread Felix Cheung
How would you handle this in Scala? If you are adding a wrapper func like getSparkSession for Scala, and have your users call it, can't you do that same in SparkR? After all, while true you don't need a SparkSession object to call the R API, someone still needs to call sparkR.session() to

[SparkR] - options around setting up SparkSession / SparkContext

2017-04-21 Thread Vin J
I need to make an R environment available where the SparkSession/SparkContext needs to be setup a specific way. The user simply accesses this environment and executes his/her code. If the user code does not access any Spark functions, I do not want to create a SparkContext unnecessarily. In

What is correct behavior for spark.task.maxFailures?

2017-04-21 Thread Chawla,Sumit
I am seeing a strange issue. I had a bad behaving slave that failed the entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all task retries happen on the same slave in case of failure. My expectation was that task will be retried on different slave in case of failure, and

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Michael Armbrust
Thanks for pointing this out, Michael. Based on the conversation on the PR this seems like a risky change to include in a release branch with a default other than NEVER_INFER. +Wenchen? What do you think? On Thu, Apr 20, 2017

ML Repo using spark

2017-04-21 Thread Saikat Kanjilal
Folks, I've been building out a large machine learning repository using spark as the compute platform running on yarn and hadoop, I was wondering if folks have some best practice oriented thoughts around unit testing/integration testing this application, I am using spark-submit and a

Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"

2017-04-21 Thread dataeng88
I have a feature requests or suggestion: Spark 2.1 currently generates partitioned directory names like "timestamp=2015-06-20 08%3A00%3A00" I request + recommend that it uses the "T" delimiter between date and time portions rather than a space character like,