[VOTE] Spark 2.1.2 (RC2)

2017-09-26 Thread Holden Karau
Please vote on releasing the following candidate as Apache Spark version 2
.1.2. The vote is open until Wednesday October 4th at 23:59 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.1.2-rc2
 (fabbb7f59e47590
114366d14e15fbbff8c88593c)

List of JIRA tickets resolved in this release can be found with this filter.


The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~holden/spark-2.1.2-rc2-bin/

Release artifacts are signed with a key from:
https://people.apache.org/~holden/holdens_keys.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1251

The documentation corresponding to this release can be found at:
https://people.apache.org/~holden/spark-2.1.2-rc2-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with
the RC (make
sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

*What should happen to JIRA tickets still targeting 2.1.2?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.3.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1. That being said if
there is something which is a regression form 2.1.1 that has not been
correctly targeted please ping a committer to help target the issue (you
can see the open issues listed as impacting Spark 2.1.1 & 2.1.2

)

*What are the unresolved* issues targeted for 2.1.2

?

At this time there are no open unresolved issues.

*Is there anything different about this release?*

This is the first release in awhile not built on the AMPLAB Jenkins. This
is good because it means future releases can more easily be built and
signed securely (and I've been updating the documentation in
https://github.com/apache/spark-website/pull/66 as I progress), however the
chances of a mistake are higher with any change like this. If there
something you normally take for granted as correct when checking a release,
please double check this time :)

*Should I be committing code to branch-2.1?*

Thanks for asking! Please treat this stage in the RC process as "code
freeze" so bug fixes only. If you're uncertain if something should be back
ported please reach out. If you do commit to branch-2.1 please tag your
JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
2.1.3 fixed into 2.1.2 as appropriate.

*Why the longer voting window?*

Since there is a large industry big data conference this week I figured I'd
add a little bit of extra buffer time just to make sure everyone has a
chance to take a look.

-- 
Twitter: https://twitter.com/holdenkarau


Re: Should Flume integration be behind a profile?

2017-09-26 Thread Mridul Muralidharan
Sounds good to me.
+1


Regards,
Mridul


On Tue, Sep 26, 2017 at 2:36 AM, Sean Owen  wrote:
> Not a big deal, but I'm wondering whether Flume integration should at least
> be opt-in and behind a profile? it still sees some use (at least on our end)
> but not applicable to the majority of users. Most other third-party
> framework integrations are behind a profile, like YARN, Mesos, Kinesis,
> Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
>
> (Well, actually it annoys me that the Flume integration always fails to
> compile in IntelliJ unless you generate the sources manually)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] Data Source V2 write path

2017-09-26 Thread Wenchen Fan
I'm trying to give a summary:

Ideally data source API should only deal with data, not metadata. But one
key problem is, Spark still need to support data sources without metastore,
e.g. file format data sources.

For this kind of data sources, users have to pass the metadata information
like partitioning/bucketing to every write action of a "table"(or other
identifiers like path of a file format data source), and it's user's
responsibility to make sure these metadata information are consistent. If
it's inconsistent, the behavior is undefined, different data sources may
have different behaviors.

If we agree on this, then data source write API should have a way to pass
these metadata information, and I think using data source options is a good
choice because it's the most implicit way and doesn't require new APIs.

But then we have another problem: how to define the behavior for data
sources with metastore when the given options contain metadata information?
A typical case is `DataFrameWriter.saveAsTable`, when a user calls it with
partition columns, he doesn't know what will happen. The table may not
exist and he may create the table successfully with specified partition
columns, or the table already exist but has inconsistent partition columns
and Spark throws exception. Besides, save mode doesn't play well in this
case, as we may need different save modes for data and metadata.

My proposal: data source API should only focus on data, but concrete data
sources can implement some dirty features via options. e.g. file format
data sources can take partitioning/bucketing from options, data source with
metastore can use a special flag in options to indicate a create table
command(without writing data).

In other words, Spark connects users to data sources with a clean protocol
that only focus on data, but this protocol has a backdoor: the data source
options. Concrete data sources are free to define how to deal with
metadata, e.g. Cassandra data source can ask users to create table at
Cassandra side first, then write data at Spark side, or ask users to
provide more details in options and do CTAS at Spark side. These can be
done via options.

After catalog federation, hopefully only file format data sources still use
this backdoor.


On Tue, Sep 26, 2017 at 8:52 AM, Wenchen Fan  wrote:

> > I think it is a bad idea to let this problem leak into the new storage
> API.
>
> Well, I think using data source options is a good compromise for this. We
> can't avoid this problem until catalog federation is done, and this may not
> happen within Spark 2.3, but we definitely need data source write API in
> Spark 2.3.
>
> > Why can't we use an in-memory catalog to store the configuration of
> HadoopFS tables?
>
> We still need to support existing Spark applications which have
> `df.write.partitionBy(...).parquet(...)`. And I think it's similar to
> `DataFrameWrier.path`, according to your theory, we should not leak `path`
> to the storage API too, but we don't have other solutions for Hadoop FS
> data sources.
>
>
> Eventually I think only Hadoop FS data sources need to take these special
> options, but for now data sources that want to support
> partitioning/bucketing need to take these special options too.
>
>
> On Tue, Sep 26, 2017 at 4:36 AM, Ryan Blue  wrote:
>
>> I think it is a bad idea to let this problem leak into the new storage
>> API. By not setting the expectation that metadata for a table will exist,
>> this will needlessly complicate writers just to support the existing
>> problematic design. Why can't we use an in-memory catalog to store the
>> configuration of HadoopFS tables? I see no compelling reason why this needs
>> to be passed into the V2 write API.
>>
>> If this is limited to an implementation hack for the Hadoop FS writers,
>> then I guess that's not terrible. I just don't understand why it is
>> necessary.
>>
>> On Mon, Sep 25, 2017 at 11:26 AM, Wenchen Fan 
>> wrote:
>>
>>> Catalog federation is to publish the Spark catalog API(kind of a data
>>> source API for metadata), so that Spark is able to read/write metadata from
>>> external systems. (SPARK-15777)
>>>
>>> Currently Spark can only read/write Hive metastore, which means for
>>> other systems like Cassandra, we can only implicitly create tables with
>>> data source API.
>>>
>>> Again this is not ideal but just a workaround before we finish catalog
>>> federation. That's why the save mode description mostly refer to how data
>>> will be handled instead of metadata.
>>>
>>> Because of this, I think we still need to pass metadata like
>>> partitioning/bucketing to the data source write API. And I propose to use
>>> data source options so that it's not at API level and we can easily ignore
>>> these options in the future if catalog federation is done.
>>>
>>> The same thing applies to Hadoop FS data sources, we need to pass
>>> metadata to the writer anyway.
>>>
>>>
>>>
>>> On Tue, 

Re: Should Flume integration be behind a profile?

2017-09-26 Thread Ryan Blue
+1 for a Flume profile.

On Tue, Sep 26, 2017 at 2:36 AM, Sean Owen  wrote:

> Not a big deal, but I'm wondering whether Flume integration should at
> least be opt-in and behind a profile? it still sees some use (at least on
> our end) but not applicable to the majority of users. Most other
> third-party framework integrations are behind a profile, like YARN, Mesos,
> Kinesis, Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
>
> (Well, actually it annoys me that the Flume integration always fails to
> compile in IntelliJ unless you generate the sources manually)
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: [Spark Core] Custom Catalog. Integration between Apache Ignite and Apache Spark

2017-09-26 Thread Николай Ижиков
Hello, Xin.

Thank you for an answer.

Is there any plans to make catalog API public?
Any specific release versions or dates?


2017-09-25 20:54 GMT+03:00 Reynold Xin :

> It's probably just an indication of lack of interest (or at least there
> isn't a substantial overlap between Ignite users and Spark users). A new
> catalog implementation is also pretty fundamental to Spark and the bar for
> that would be pretty high. See my comment in SPARK-17767.
>
> Guys - while I think this is very useful to do, I'm going to mark this as
> "later" for now. The reason is that there are a lot of things to consider
> before making this switch, including:
>
>- The ExternalCatalog API is currently internal, and we can't just
>make it public without thinking about the consequences and whether this API
>is maintainable in the long run.
>- SPARK-15777  We
>need to design this in the context of catalog federation and persistence.
>- SPARK-15691  
> Refactoring
>of how we integrate with Hive.
>
> This is not as simple as just submitting a PR to make it pluggable.
>
> On Mon, Sep 25, 2017 at 10:50 AM, Николай Ижиков 
> wrote:
>
>> Guys.
>>
>> Am I miss something and wrote a fully wrong mail?
>> Can you give me some feedback?
>> What I have missed with my propositions?
>>
>> 2017-09-19 15:39 GMT+03:00 Nikolay Izhikov :
>>
>>> Guys,
>>>
>>> Anyone had a chance to look at my message?
>>>
>>> 15.09.2017 15:50, Nikolay Izhikov пишет:
>>>
>>> Hello, guys.

 I’m contributor of Apache Ignite project which is self-described as an
 in-memory computing platform.

 It has Data Grid features: distribute, transactional key-value store
 [1], Distributed SQL support [2], etc…[3]

 Currently, I’m working on integration between Ignite and Spark [4]
 I want to add support of Spark Data Frame API for Ignite.

 As far as Ignite is distributed store it would be useful to create
 implementation of Catalog [5] for an Apache Ignite.

 I see two ways to implement this feature:

  1. Spark can provide API for any custom catalog implementation. As
 far as I can see there is a ticket for it [6]. It is closed with resolution
 “Later”. Is it suitable time to continue working on the ticket? How can I
 help with it?

  2. I can provide an implementation of Catalog and other required
 API in the form of pull request in Spark, as it was implemented for Hive
 [7]. Can such pull request be acceptable?

 Which way is more convenient for Spark community?

 [1] https://ignite.apache.org/features/datagrid.html
 [2] https://ignite.apache.org/features/sql.html
 [3] https://ignite.apache.org/features.html
 [4] https://issues.apache.org/jira/browse/IGNITE-3084
 [5] https://github.com/apache/spark/blob/master/sql/catalyst/src
 /main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 [6] https://issues.apache.org/jira/browse/SPARK-17767
 [7] https://github.com/apache/spark/blob/master/sql/hive/src/mai
 n/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

>>>
>>
>>
>> --
>> Nikolay Izhikov
>> nizhikov@gmail.com
>>
>
>


-- 
Nikolay Izhikov
nizhikov@gmail.com


Should Flume integration be behind a profile?

2017-09-26 Thread Sean Owen
Not a big deal, but I'm wondering whether Flume integration should at least
be opt-in and behind a profile? it still sees some use (at least on our
end) but not applicable to the majority of users. Most other third-party
framework integrations are behind a profile, like YARN, Mesos, Kinesis,
Kafka 0.8, Docker. Just soliciting comments, not arguing for it.

(Well, actually it annoys me that the Flume integration always fails to
compile in IntelliJ unless you generate the sources manually)