date:20180801

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread shane knapp

++ssuchter (who kindly set up the initial k8s builds while i hammered on
the backend)

while i'm pretty confident (read: 99%) that the pull request builds will
work on the new ubuntu workers:

1) i'd like to do more stress testing of other spark builds (in progress)
2) i'd like to reimage more centos workers before moving the PRB due to
potential executor starvation, and my lead sysadmin is out until next monday
3) we will need to get rid of the ubuntu-specific k8s builds and merge that
functionality in to the existing PRB job.  after that:  testing and
babysitting

regarding (1):  if these damn builds didn't take 4+ hours, it would be
going a lot quicker.  ;)
regarding (2):  adding two more ubuntu workers would make me comfortable
WRT number of available executors, and i will guarantee that can happen by
EOD on the 7th.
regarding (3):  this should take about a day, and realistically the
earliest we can get this started is the 8th.  i haven't even had a chance
to start looking at this stuff yet, either.

if we push release by a week, i think i can get things sorted w/o impacting
the release schedule.  there will still be a bunch of stuff to clean up
from the old centos builds (specifically docs, packaging and release), but
i'll leave the existing and working infrastructure in place for now.

shane

On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson  wrote:

> The PR for SparkR support on the kube back-end is completed, but waiting
> for Shane to make some tweaks to the CI machinery for full testing support.
> If the code freeze is being delayed, this PR could be merged as well.
>
> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>
>> FYI 6 mo is coming up soon since the last release. We will cut the branch
>> and code freeze on Aug 1st in order to get 2.4 out on time.
>>
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson

The PR for SparkR support on the kube back-end is completed, but waiting
for Shane to make some tweaks to the CI machinery for full testing support.
If the code freeze is being delayed, this PR could be merged as well.

On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson

I agree that looking at it from the pov of "code paths where isBarrier
tests were introduced" seems right.

>From pr-21758  (the one
already merged) there are 13 files touched under
core/src/main/scala/org/apache/spark/scheduler/, although most of those
appear to be relatively small edits. The "big" modifications are
concentrated on Task.scala and TaskSchedulerImpl.scala. The followup
pr-21898  touches a
subset of those.

The project-hydrogen epic for "barrier execution" SPARK-24374
 contains 22 sub-issues,
most of which are still open. Some are marked for future release cycles; Is
there a specific set being proposed for 2.4?  The various back-end supports
look tagged for subsequent release cycles: is the 2.4 scope standalone
clusters?

CI will obviously exercise standard task scheduling code paths, which
indicates some level of stability.  Folks on the k8s big data SIG today
were interested in building test distributions for the barrier-related
features. I was reflecting that although the spark-on-kube fork was awkward
in some ways, it did provide a unified distribution that interested
community members could build, download and/or run. Project hydrogen is
currently incarnated as a set of PRs, but a unified test build that
included pr-21758  and
pr-21898  (and others?)
would be cool. I've never seen an ideal workflow for handling multi-PR
development efforts.

On Wed, Aug 1, 2018 at 1:43 PM, Imran Rashid  wrote:

> I still would like to do more review on barrier mode changes, but from
> what I've seen so far I agree. I dunno if it'll really be ready for use,
> but it should not pose much risk for code which doesn't touch the new
> features.  of course, every change has some risk, especially in the
> scheduler which has proven to be very brittle (I've written plenty of
> scheduler bugs while fixing other things myself).
>
> On Wed, Aug 1, 2018 at 1:13 PM, Xingbo Jiang 
> wrote:
>
>> Speaking of the code from hydrogen PRs, actually we didn't remove any of
>> the existing logic, and I tried my best to hide almost all of the newly
>> added logic behind a `isBarrier` tag (or something similar). I have to add
>> some new variables and new methods to the core code paths, but I think they
>> shall not be hit if you are not running barrier workloads.
>>
>> The only significant change I can think of is I swapped the sequence of
>> failure handling in DAGScheduler, moving the `case FetchFailed` block to
>> before the `case Resubmitted` block, but again I don't think this shall
>> affect a regular workload because anyway you can only have one failure type.
>>
>> Actually I also reviewed the previous PRs adding Spark on K8s support,
>> and I feel it's a good example of how to add new features to a project
>> without breaking existing workloads, I'm trying to follow that way in
>> adding barrier execution mode support.
>>
>> I really appreciate any notice on hydrogen PRs and welcome comments to
>> help improve the feature, thanks!
>>
>> 2018-08-01 4:19 GMT+08:00 Reynold Xin :
>>
>>> I actually totally agree that we should make sure it should have no
>>> impact on existing code if the feature is not used.
>>>
>>>
>>> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
>>> wrote:
>>>
 I don't have a comprehensive knowledge of the project hydrogen PRs,
 however I've perused them, and they make substantial modifications to
 Spark's core DAG scheduler code.

 What I'm wondering is: how high is the confidence level that the
 "traditional" code paths are still stable. Put another way, is it even
 possible to "turn off" or "opt out" of this experimental feature? This
 analogy isn't perfect, but for example the k8s back-end is a major body of
 code, but it has a very small impact on any *core* code paths, and so if
 you opt out of it, it is well understood that you aren't running any
 experimental code.

 Looking at the project hydrogen code, I'm less sure the same is true.
 However, maybe there is a clear way to show how it is true.

 On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra >>> > wrote:

> No reasonable amount of time is likely going to be sufficient to fully
> vet the code as a PR. I'm not entirely happy with the design and code as
> they currently are (and I'm still trying to find the time to more publicly
> express my thoughts and concerns), but I'm fine with them going into 2.4
> much as they are as long as they go in with proper stability annotations
> and are understood not to be cast-in-stone final implementations, but
> rather as a way to get people using them and generating the feedback that
> is necessary to get us to something more like a final design and
>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Imran Rashid

I still would like to do more review on barrier mode changes, but from what
I've seen so far I agree. I dunno if it'll really be ready for use, but it
should not pose much risk for code which doesn't touch the new features.
of course, every change has some risk, especially in the scheduler which
has proven to be very brittle (I've written plenty of scheduler bugs while
fixing other things myself).

On Wed, Aug 1, 2018 at 1:13 PM, Xingbo Jiang  wrote:

> Speaking of the code from hydrogen PRs, actually we didn't remove any of
> the existing logic, and I tried my best to hide almost all of the newly
> added logic behind a `isBarrier` tag (or something similar). I have to add
> some new variables and new methods to the core code paths, but I think they
> shall not be hit if you are not running barrier workloads.
>
> The only significant change I can think of is I swapped the sequence of
> failure handling in DAGScheduler, moving the `case FetchFailed` block to
> before the `case Resubmitted` block, but again I don't think this shall
> affect a regular workload because anyway you can only have one failure type.
>
> Actually I also reviewed the previous PRs adding Spark on K8s support, and
> I feel it's a good example of how to add new features to a project without
> breaking existing workloads, I'm trying to follow that way in adding
> barrier execution mode support.
>
> I really appreciate any notice on hydrogen PRs and welcome comments to
> help improve the feature, thanks!
>
> 2018-08-01 4:19 GMT+08:00 Reynold Xin :
>
>> I actually totally agree that we should make sure it should have no
>> impact on existing code if the feature is not used.
>>
>>
>> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
>> wrote:
>>
>>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>>> however I've perused them, and they make substantial modifications to
>>> Spark's core DAG scheduler code.
>>>
>>> What I'm wondering is: how high is the confidence level that the
>>> "traditional" code paths are still stable. Put another way, is it even
>>> possible to "turn off" or "opt out" of this experimental feature? This
>>> analogy isn't perfect, but for example the k8s back-end is a major body of
>>> code, but it has a very small impact on any *core* code paths, and so if
>>> you opt out of it, it is well understood that you aren't running any
>>> experimental code.
>>>
>>> Looking at the project hydrogen code, I'm less sure the same is true.
>>> However, maybe there is a clear way to show how it is true.
>>>
>>>
>>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>>> wrote:
>>>
 No reasonable amount of time is likely going to be sufficient to fully
 vet the code as a PR. I'm not entirely happy with the design and code as
 they currently are (and I'm still trying to find the time to more publicly
 express my thoughts and concerns), but I'm fine with them going into 2.4
 much as they are as long as they go in with proper stability annotations
 and are understood not to be cast-in-stone final implementations, but
 rather as a way to get people using them and generating the feedback that
 is necessary to get us to something more like a final design and
 implementation.

 On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
 wrote:

>
> Barrier mode seems like a high impact feature on Spark's core code: is
> one additional week enough time to properly vet this feature?
>
> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Full continuous processing aggregation support ran into unanticipated
>> scalability and scheduling problems. We’re planning to overcome those by
>> using some of the barrier execution machinery, but since barrier 
>> execution
>> itself is still in progress the full support isn’t going to make it into
>> 2.4.
>>
>> Jose
>>
>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
>> tomasz.gaw...@outlook.com> wrote:
>>
>>> Hi,
>>>
>>> what is the status of Continuous Processing + Aggregations? As far
>>> as I
>>> remember, Jose Torres said it should  be easy to perform
>>> aggregations if
>>> coalesce(1) work. IIRC it's already merged to master.
>>>
>>> Is this work in progress? If yes, it would be great to have full
>>> aggregation/join support in Spark 2.4 in CP.
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomek
>>>
>>>
>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>> > This one is important to us: https://issues.apache.org/jira
>>> /browse/SPARK-24020 (Sort-merge join inner range optimization) but
>>> I think it could be useful to others too.
>>> >
>>> > It is finished and is ready to be merged (was ready a month ago at
>>> least).
>>> >
>>> > Do you think you could consider including it in 2.4?
>>> >
>>> > Petar
>>> >
>>> >

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xingbo Jiang

Speaking of the code from hydrogen PRs, actually we didn't remove any of
the existing logic, and I tried my best to hide almost all of the newly
added logic behind a `isBarrier` tag (or something similar). I have to add
some new variables and new methods to the core code paths, but I think they
shall not be hit if you are not running barrier workloads.

The only significant change I can think of is I swapped the sequence of
failure handling in DAGScheduler, moving the `case FetchFailed` block to
before the `case Resubmitted` block, but again I don't think this shall
affect a regular workload because anyway you can only have one failure type.

Actually I also reviewed the previous PRs adding Spark on K8s support, and
I feel it's a good example of how to add new features to a project without
breaking existing workloads, I'm trying to follow that way in adding
barrier execution mode support.

I really appreciate any notice on hydrogen PRs and welcome comments to help
improve the feature, thanks!

2018-08-01 4:19 GMT+08:00 Reynold Xin :

> I actually totally agree that we should make sure it should have no impact
> on existing code if the feature is not used.
>
>
> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
> wrote:
>
>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>> however I've perused them, and they make substantial modifications to
>> Spark's core DAG scheduler code.
>>
>> What I'm wondering is: how high is the confidence level that the
>> "traditional" code paths are still stable. Put another way, is it even
>> possible to "turn off" or "opt out" of this experimental feature? This
>> analogy isn't perfect, but for example the k8s back-end is a major body of
>> code, but it has a very small impact on any *core* code paths, and so if
>> you opt out of it, it is well understood that you aren't running any
>> experimental code.
>>
>> Looking at the project hydrogen code, I'm less sure the same is true.
>> However, maybe there is a clear way to show how it is true.
>>
>>
>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>> wrote:
>>
>>> No reasonable amount of time is likely going to be sufficient to fully
>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>> they currently are (and I'm still trying to find the time to more publicly
>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>> much as they are as long as they go in with proper stability annotations
>>> and are understood not to be cast-in-stone final implementations, but
>>> rather as a way to get people using them and generating the feedback that
>>> is necessary to get us to something more like a final design and
>>> implementation.
>>>
>>> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>>> wrote:
>>>

 Barrier mode seems like a high impact feature on Spark's core code: is
 one additional week enough time to properly vet this feature?

 On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
 joseph.tor...@databricks.com> wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
> tomasz.gaw...@outlook.com> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as
>> I
>> remember, Jose Torres said it should  be easy to perform aggregations
>> if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us: https://issues.apache.org/
>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization)
>> but I think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we
>> should consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It
>> only has a few remaining works and I think we should have it in Spark 
>> 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xiangrui Meng

Sorry for late response on Hydrogen discussions! I was traveling last week.

On Tue, Jul 31, 2018 at 1:20 PM Reynold Xin  wrote:

> I actually totally agree that we should make sure it should have no impact
> on existing code if the feature is not used.
>
>
> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
> wrote:
>
>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>> however I've perused them, and they make substantial modifications to
>> Spark's core DAG scheduler code.
>>
>> What I'm wondering is: how high is the confidence level that the
>> "traditional" code paths are still stable. Put another way, is it even
>> possible to "turn off" or "opt out" of this experimental feature? This
>> analogy isn't perfect, but for example the k8s back-end is a major body of
>> code, but it has a very small impact on any *core* code paths, and so if
>> you opt out of it, it is well understood that you aren't running any
>> experimental code.
>>
>> Looking at the project hydrogen code, I'm less sure the same is true.
>> However, maybe there is a clear way to show how it is true.
>>
>>
Totally agree that the barrier execution mode must not change any existing
behaviors if barriers are not used. Most code added to DAGScheduler and
TaskSetManager only applies to the barrier mode and we paid special
attention to the rest during review. That being said, I won't say the risk
is zero. We will do comprehensive QA after feature freeze and it would be
great if more community members can help.

Btw, I don't think a feature flag would help reduce the risk. This is a
brand new feature, not an alternative to an existing one. So turning it off
is basically "do not call barrier()".


>
>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>> wrote:
>>
>>> No reasonable amount of time is likely going to be sufficient to fully
>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>> they currently are (and I'm still trying to find the time to more publicly
>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>> much as they are as long as they go in with proper stability annotations
>>> and are understood not to be cast-in-stone final implementations, but
>>> rather as a way to get people using them and generating the feedback that
>>> is necessary to get us to something more like a final design and
>>> implementation.
>>>
>>>
All barrier execution mode features will be marked experimental in 2.4. As
you mentioned, the goal is to get some usage and collect feedback so we
have a robust stable version in 3.0. Mark, it would be great if you can
provide input and help the final design. Your time would be greatly
appreciated!


> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>>> wrote:
>>>

 Barrier mode seems like a high impact feature on Spark's core code: is
 one additional week enough time to properly vet this feature?

 On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
 joseph.tor...@databricks.com> wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
> tomasz.gaw...@outlook.com> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as
>> I
>> remember, Jose Torres said it should  be easy to perform aggregations
>> if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us:
>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>> inner range optimization) but I think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we
>> should consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It
>> only has a few remaining works and I think we should have it in Spark 
>> 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions in this release,

Re: [build system] DOWNTIME jenkins unreachable overnight

2018-08-01 Thread shane knapp

the UPS has been replaced, and you can now access the wonderful entity
known as jenkins via the internet superhighway!

shane (who only really showed up early to work and didn't actually help
replace said UPS)

On Tue, Jul 31, 2018 at 5:14 PM, shane knapp  wrote:

> our building is finally replacing the broken UPS that keeps biting us...
>
> ...which means another bit of downtime.  :(
>
> it begins in 6 hours (11pm PDT) and will be finished tomorrow (august 1st)
> by ~8am PDT.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: [DISCUSS] Multiple catalog support

2018-08-01 Thread Wenchen Fan

For the first question. This is what we already supported. A data source
can implement `ReadSupportProvider` (based on my API improvement
)
so that it can create `ReadSupport` by reflection. I agree with you that
most of the data sources would implement `TableCatalog` instead in the
future.

For the second question. Overall looks good. One issue is if we should
generalize the hive STORE AS syntax as well.

For the third question, I agree we should figure out the expected behavior
first.

On Wed, Aug 1, 2018 at 4:47 AM Ryan Blue  wrote:

> Wenchen, I think the misunderstanding is around how the v2 API should work
> with multiple catalogs.
>
> Data sources are read/write implementations that resolve to a single JVM
> class. When we consider how these implementations should work with multiple
> table catalogs, I think it is clear that the catalog needs to be able to
> choose the implementation and should be able to share implementations
> across catalogs. Those requirements are incompatible with the idea that
> Spark should get a catalog from the data source.
>
> An easy way to think about this is the Parquet example from my earlier
> email. *Why would using format("parquet") determine the catalog where a
> table is created?*
>
> The conclusion I came to is that to support CTAS and other operations that
> require a catalog, Spark should determine that catalog first, not the
> storage implementation (data source) first. The catalog should return a
> Table that implements ReadSupport and WriteSupport. The actual
> implementation class doesn’t need to be chosen by users.
>
> That leaves a few open questions.
>
> First open question: *How can we support reading tables without metadata?*
> This is your load example: df.read.format("xyz").option(...).load.
>
> I think we should continue to use the DataSource v1 loader to load a
> DataSourceV2, then define a way for that to return a Table with ReadSupport
> and WriteSupport, like this:
>
> interface DataSourceV2 {
>   public Table anonymousTable(Map tableOptions);
> }
>
> While I agree that these tables without metadata should be supported, many
> of the current uses are actually working around missing multi-catalog
> support. JDBC is a good example. You have to point directly to a JDBC table
> using the source and options because we don’t have a way to connect to JDBC
> as a catalog. If we make catalog definition easy, then we can support CTAS
> to JDBC, make it simpler to load several tables in the same remote
> database, etc. This would also improve working with persistent JDBC tables
> because it would connect to the source of truth for table metadata instead
> of copying it into the ExternalCatalog from the Spark session.
>
> In other words, the case we should be primarily targeting is catalog-based
> tables, not tables without metadata.
>
> Second open question: *How should the format method and USING clause
> work?*
>
> I think these should be passed to the catalog and the catalog can decide
> what to do. Formats like “parquet” and “json” are currently replaced with a
> concrete Java class, so there’s precedent for these as information for the
> catalog and not concrete implementations. These should be optional and
> should get passed to any catalog.
>
> The implementation of TableCatalog backed by the current ExternalCatalog
> can continue to use format / USING to choose the data source directly,
> but there’s no requirement for other catalogs to do that because there are
> no other catalogs right now. Passing this to an Iceberg catalog could
> determine whether Iceberg’s underlying storage is “avro” or “parquet”, even
> though Iceberg uses a different data source implementation.
>
> Third open question: *How should path-based tables work?*
>
> First, path-based tables need clearly defined behavior. That’s missing
> today. I’ve heard people cite the “feature” that you can write a different
> schema to a path-based JSON table without needing to run an “alter table”
> on it to update the schema. If this is behavior we want to preserve (and I
> think it is) then we need to clearly state what that behavior is.
>
> Second, I think that we can build a TableCatalog-like interface to handle
> path tables.
>
> rb
> 
> On Tue, Jul 31, 2018 at 7:58 AM Wenchen Fan  wrote:
>
>> Here is my interpretation of your proposal, please correct me if
>> something is wrong.
>>
>> End users can read/write a data source with its name and some options.
>> e.g. `df.read.format("xyz").option(...).load`. This is currently the only
>> end-user API for data source v2, and is widely used by Spark users to
>> read/write data source v1 and file sources, we should still support it. We
>> will add more end-user APIs in the future, once we standardize the DDL
>> logical plans.
>>
>> If a data source wants to be used with tables, then it must implement
>> some catalog functionalities. At least it

Re: Writing file

2018-08-01 Thread mattbuttow

Thank you cloud0fan. That's really helpful.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: code freeze and branch cut for Apache Spark 2.4

Re: code freeze and branch cut for Apache Spark 2.4

Re: code freeze and branch cut for Apache Spark 2.4

Re: code freeze and branch cut for Apache Spark 2.4

Re: code freeze and branch cut for Apache Spark 2.4

Re: code freeze and branch cut for Apache Spark 2.4

Re: [build system] DOWNTIME jenkins unreachable overnight

Re: [DISCUSS] Multiple catalog support

Re: Writing file

9 matches

Site Navigation

Mail list logo

Footer information