Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Wenchen Fan
IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only
scan all table files only once, and write back the inferred schema to
metastore so that we don't need to do the schema inference again.

So technically this will introduce a performance regression for the first
query, but compared to branch-2.0, it's not performance regression. And
this patch fixed a regression in branch-2.1, which can run in branch-2.0.
Personally, I think we should keep INFER_AND_SAVE as the default mode.

+ [Eric], what do you think?

On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust 
wrote:

> Thanks for pointing this out, Michael.  Based on the conversation on the
> PR 
> this seems like a risky change to include in a release branch with a
> default other than NEVER_INFER.
>
> +Wenchen?  What do you think?
>
> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
> wrote:
>
>> We've identified the cause of the change in behavior. It is related to
>> the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key
>> and its related functionality was absent from our previous build. The
>> default setting in the current build was causing Spark to attempt to scan
>> all table files during query analysis. Changing this setting to NEVER_INFER
>> disabled this operation and resolved the issue we had.
>>
>> Michael
>>
>>
>> On Apr 20, 2017, at 3:42 PM, Michael Allman  wrote:
>>
>> I want to caution that in testing a build from this morning's branch-2.1
>> we found that Hive partition pruning was not working. We found that Spark
>> SQL was fetching all Hive table partitions for a very simple query whereas
>> in a build from several weeks ago it was fetching only the required
>> partitions. I cannot currently think of a reason for the regression outside
>> of some difference between branch-2.1 from our previous build and
>> branch-2.1 from this morning.
>>
>> That's all I know right now. We are actively investigating to find the
>> root cause of this problem, and specifically whether this is a problem in
>> the Spark codebase or not. I will report back when I have an answer to that
>> question.
>>
>> Michael
>>
>>
>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust 
>> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.1-rc3
>>  (2ed19cff2f6ab79
>> a718526e5d16633412d8c4dd4)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1230/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.0.
>>
>> *What happened to RC1?*
>>
>> There were issues with the release packaging and as a result was skipped.
>>
>>
>>
>>
>


Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-21 Thread Felix Cheung
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your 
users call it, can't you do that same in SparkR? After all, while true you 
don't need a SparkSession object to call the R API, someone still needs to call 
sparkR.session() to initial the current session?

Also what Spark environment you want to customize?

Can these be set in environment variables or via spark-defaults.conf 
spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_
From: Vin J >
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: >



I need to make an R environment available where the SparkSession/SparkContext 
needs to be setup a specific way. The user simply accesses this environment and 
executes his/her code. If the user code does not access any Spark functions, I 
do not want to create a SparkContext unnecessarily.

In Scala/Python environments, the user can't access spark without first 
referencing SparkContext / SparkSession classes. So the above (lazy and/or 
custom SparkSession/Context creation) is easily met by offering 
sparkContext/sparkSession handles to the user that are either wrappers on 
Spark's classes or have lazy evaluation semantics. This way only when the user 
accesses these handles to sparkContext/Session will the SparkSession/Context 
actually get set up without the user needing to know all the details about 
initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R. From 
what I see, executing sparkR.session(...) sets up private variables in 
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a 
user doesn't need a handle to the spark session as such. Executing functions 
like so:  "df <- as.DataFrame(..)" implicitly access the private vars in 
SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to 
have been created by a prior call to sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way 
except through having my code (that sits outside of Spark) apply a 
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / 
.sparkRjsc  variables. This way when spark code internally references them, my 
wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to SparkR 
- it will not work unless some minor changes are made in the SparkR code. But, 
before I opened a PR that would do these changes in SparkR I wanted to check if 
there was a better way to achieve this? I am far less than an R expert, and 
could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and 
open one.

Regards,
Vin.






[SparkR] - options around setting up SparkSession / SparkContext

2017-04-21 Thread Vin J
I need to make an R environment available where the
SparkSession/SparkContext needs to be setup a specific way. The user simply
accesses this environment and executes his/her code. If the user code does
not access any Spark functions, I do not want to create a SparkContext
unnecessarily.

In Scala/Python environments, the user can't access spark without first
referencing SparkContext / SparkSession classes. So the above (lazy and/or
custom SparkSession/Context creation) is easily met by offering
sparkContext/sparkSession handles to the user that are either wrappers on
Spark's classes or have lazy evaluation semantics. This way only when the
user accesses these handles to sparkContext/Session will the
SparkSession/Context actually get set up without the user needing to know
all the details about initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R.
>From what I see, executing sparkR.session(...) sets up private variables in
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api
works, a user doesn't need a handle to the spark session as such. Executing
functions like so:  "df <- as.DataFrame(..)" implicitly access the private
vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are
expected to have been created by a prior call to
sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way
except through having my code (that sits outside of Spark) apply a
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession /
.sparkRjsc  variables. This way when spark code internally references them,
my wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to
SparkR - it will not work unless some minor changes are made in the SparkR
code. But, before I opened a PR that would do these changes in SparkR I
wanted to check if there was a better way to achieve this? I am far less
than an R expert, and could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead
and open one.

Regards,
Vin.


What is correct behavior for spark.task.maxFailures?

2017-04-21 Thread Chawla,Sumit
I am seeing a strange issue. I had a bad behaving slave that failed the
entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems like
all task retries happen on the same slave in case of failure.  My
expectation was that task will be retried on different slave in case of
failure, and chance of all 8 retries to happen on same slave is very less.


Regards
Sumit Chawla


Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Michael Armbrust
Thanks for pointing this out, Michael.  Based on the conversation on the PR
 this
seems like a risky change to include in a release branch with a default
other than NEVER_INFER.

+Wenchen?  What do you think?

On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
wrote:

> We've identified the cause of the change in behavior. It is related to the
> SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key and
> its related functionality was absent from our previous build. The default
> setting in the current build was causing Spark to attempt to scan all table
> files during query analysis. Changing this setting to NEVER_INFER disabled
> this operation and resolved the issue we had.
>
> Michael
>
>
> On Apr 20, 2017, at 3:42 PM, Michael Allman  wrote:
>
> I want to caution that in testing a build from this morning's branch-2.1
> we found that Hive partition pruning was not working. We found that Spark
> SQL was fetching all Hive table partitions for a very simple query whereas
> in a build from several weeks ago it was fetching only the required
> partitions. I cannot currently think of a reason for the regression outside
> of some difference between branch-2.1 from our previous build and
> branch-2.1 from this morning.
>
> That's all I know right now. We are actively investigating to find the
> root cause of this problem, and specifically whether this is a problem in
> the Spark codebase or not. I will report back when I have an answer to that
> question.
>
> Michael
>
>
> On Apr 18, 2017, at 11:59 AM, Michael Armbrust 
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc3
>  (2ed19cff2f6ab79
> a718526e5d16633412d8c4dd4)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1230/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>
>
>
>


ML Repo using spark

2017-04-21 Thread Saikat Kanjilal
Folks,
I've been building out a large machine learning repository using spark as the 
compute platform running on yarn and hadoop, I was wondering if folks have some 
best practice oriented thoughts around unit testing/integration testing this 
application, I am using spark-submit and a configuration file to enable a 
dynamic workflow such that we can build different ML repos for each of our 
models. The ML repos consist of parquet files and eventually hive tables.I want 
to be able to unit test this application using scalatest or some other 
recommended utility, I also want to integration test the application in our int 
environment, specifically we have a dev/int and eventually prod and a prod 
environment consisting of spark running on hadoop usign yarn.


The ideal workflow in my mind would be:
1) unit tests run upon every checkin in our dev enviroment
2) application gets propagated to our int environment
3) integration tests run successfully in our int environment
4) application gets propagated to our prod environment
5) hive table/parquet file gets generated and consumed by scala notebooks 
running on top of spark cluster


**Caveat I wasnt sure if this was more appropriate for dev or user mailing list 
but given that I only am following dev I sent this here.


Best Regards


Timestamp formatting in partitioned directory output: "YYYY-MM-dd HH%3Amm%3Ass" vs "YYYY-MM-ddTHH%3Amm%3Ass"

2017-04-21 Thread dataeng88

I have a feature requests or suggestion:

Spark 2.1 currently generates partitioned directory names like
"timestamp=2015-06-20 08%3A00%3A00"

I request + recommend that it uses the "T" delimiter between date and time
portions rather than a space character like,
"timestamp=2015-06-20T08%3A00%3A00".

Two reasons:
1) The official ISO-8601 formatting standard specifies a "T" delimiter. RFC
3339 built on top of ISO-8601 says that a space character is also
acceptable, but AFAIK, that is not part of the official ISO-8601 spec.

2) URIs can't have spaces in them.
"s3://mybucket/data/timestamp=-MM-ddTHH%3A:mm:ss" is a valid URI, while
the space character variant is not. Spark is already doing URI escaping of
the "colon" characters with "%3A". Spark should use a URI compliant "T"
character rather than a space.

This also applies to reading existing data. If I load a data frame with
directory timestamp partitioning that uses the Spark standard space
delimiter between date and time, Spark will automatically recognize the
field as a timestamp. If the directory name uses the ISO-8601 standard "T"
delimiter between date and time, Spark will not recognize the field as a
timestamp but rather as a generic string.

Below is a short code snippet that can be pasted into spark-shell to
reproduce this issue

```
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDateTime

val simpleSchema = StructType(
StructField("id", IntegerType) ::
StructField("name", StringType) ::
StructField("value", StringType) ::
StructField("timestamp", TimestampType) :: Nil)

val data = List(
Row(1, "Alice", "C101",
java.sql.Timestamp.valueOf(LocalDateTime.of(2015, 6, 20, 8, 0))),
Row(2, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 20, 8, 0))),
Row(3, "Bob", "C102", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 20, 9, 0))),
Row(4, "Bob", "C101", java.sql.Timestamp.valueOf(LocalDateTime.of(2015,
6, 21, 9, 0)))
)

val df = spark.createDataFrame(data.asJava, simpleSchema)
df.printSchema()
df.show()
df.write.partitionBy("timestamp").save("test/")
```

~ find test -type d
test
test/timestamp=2015-06-20 08%3A00%3A00
test/timestamp=2015-06-20 09%3A00%3A00
test/timestamp=2015-06-21 09%3A00%3A00




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-formatting-in-partitioned-directory-output--MM-dd-HH-3Amm-3Ass-vs--MM-ddTHH-3Amm-3-tp21404.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org