Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Joseph Torres
It's worth noting that the push-based shuffle SPIP currently in progress addresses a substantial blocker in the area. If you remember when we removed the half-finished stateful query support, the lack of that functionality and the challenge of implementing it is basically why it was half-finished.

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Joseph Torres
+1 On Mon, Sep 14, 2020 at 6:39 PM angers.zhu wrote: > +1 > > angers.zhu > angers@gmail.com > >

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Joseph Torres
The problem is that there isn't a consistent number of seconds an interval represents - as Wenchen mentioned, a month interval isn't a fixed number of days. If your use case can account for that, maybe you could add the interval to a fixed reference date and then compare the result. On Tue, Feb 11

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Joseph Torres
congratulations! On Mon, Sep 9, 2019 at 6:27 PM 王 斐 wrote: > congratulations! > > 获取 Outlook for iOS > > -- > *发件人:* Ye Xianjin > *发送时间:* 星期二, 九月 10, 2019 09:26 > *收件人:* Jeff Zhang > *抄送:* Saisai Shao; dev > *主题:* Re: Welcoming some new commit

Re: Partitions at DataSource API V2

2019-03-13 Thread Joseph Torres
The reader necessarily knows the number of partitions, since it's responsible for generating its output partitions in the first place. I won't speak for everyone, but it would make sense to me to pass in a Partitioning instance to the writer, since it's already part of the v2 interface through the

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Joseph Torres
I'm not worried about rushing. I worry that, without clear parameters for the amount or types of DSv2 delays that are acceptable, we might end up holding back 3.0 indefinitely to meet the deadline when we wouldn't have made that decision de novo. (Or even worse, the PMC eventually feels they must r

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Joseph Torres
I’m sure we, as a community, will seriously consider any proposal that Spark would benefit if the PMC delays release X to include changes A, B, C. Indeed, every release I remember has had a few iterations of “can we hold the train for a bit because it would be super great to get this PR in”. Many

Re: DSv2 question

2019-01-24 Thread Joseph Torres
I wouldn't be opposed to also documenting that we canonicalize the keys as lowercase, but the case-insensitivity is I think the primary property. It's important to call out that data source developers don't have to worry about a semantic difference between option("mykey", "value") and option("myKey

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Joseph Torres
itions and making it key based would > still be worth exploring to support dynamic state rebalancing. May be the > default HDFS based implementation can maintain the state partition wise and > not support it, but there could be implementations based on distributed k-v > store which support

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Joseph Torres
ch is excluded in query previously. > Moreover, it can’t find existing row from state or store row in wrong > partition if partition id doesn’t match the expected id via hashing > function. > > Could you verify coalesce() meets such requirements? > > On Fri, 3 Aug 2018 at 22:23 Jose

Re: [Proposal] New feature: reconfigurable number of partitions on stateful operators in Structured Streaming

2018-08-03 Thread Joseph Torres
Scheduling multiple partitions in the same task is basically what coalesce() does. Is there a reason that doesn't work here? On Fri, Aug 3, 2018 at 5:55 AM, Jungtaek Lim wrote: > Here's a link for Google docs (anyone can comment): > https://docs.google.com/document/d/1DEOW3WQcPUq0YFgazkZx6Ei6EOd

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Joseph Torres
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4. Jo

Re: JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

2018-07-16 Thread Joseph Torres
I guess the question is partly about the semantics of DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will definitely have exactly this schema", that doesn't quite match the behavior of the customSchema option. If it's only meant to be an arbitrary schema input which the sour

Re: TextSocketMicroBatchReader no longer supports nc utility

2018-06-04 Thread Joseph Torres
I tend to agree that this is a bug. It's kinda silly that nc does this, but a socket connector that doesn't work with netcat will surely seem broken to users. It wouldn't be a huge change to defer opening the socket until a read is actually required. On Sun, Jun 3, 2018 at 9:55 PM, Jungtaek Lim w

Design proposal for streaming APIs in data source V2

2018-05-24 Thread Joseph Torres
Hi all, https://docs.google.com/document/d/1VzxEuvpLfuHKL6vJO9qJ6ug0x9J_gLoLSH_vJL3-Cho I've finished a full design proposal for streaming APIs in data source V2, following up on my earlier doc with just the writer. Please take a look. (Note that slightly different versions of the APIs already e

Design for continuous processing shuffle

2018-05-04 Thread Joseph Torres
Hi all, A few of us have been working on a design for how to do shuffling in continuous processing. Feel free to chip in if you have any comments or questions. doc: https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE continuous processing SPIP: https://issues.apache.o

Re: [Structured streaming, V2] commit on ContinuousReader

2018-05-03 Thread Joseph Torres
In the master branch, we currently call this method in ContinuousExecution.commit(). Note that the ContinuousReader API is experimental and undergoing active design work. We will definitely include some kind of functionality to back-commit data once it's been processed, but the handle we eventuall

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Joseph Torres
iles' modified timestamps? > > For most sources, I think Spark should handle state serialization and > recovery. Maybe we can find a good way to make the file source with > unbounded state work, but this shouldn't be one of the driving cases for > the design and con

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres
h state that grows > indefinitely? It sounds like we're letting a bad case influence the > design, when we probably shouldn't. > > On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres < > joseph.tor...@databricks.com> wrote: > >> Offset is just a type alias for arbitra

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres
7;t > leave unknown state on a single machine's file system. > > rb > > On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres < > joseph.tor...@databricks.com> wrote: > >> The precise interactions with the DataSourceV2 API haven't yet been >> hammered out in desi

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Joseph Torres
The precise interactions with the DataSourceV2 API haven't yet been hammered out in design. But much of this comes down to the core of Structured Streaming rather than the API details. The execution engine handles checkpointing and recovery. It asks the streaming data source for offsets, and then

Re: Transform plan with scope

2018-04-24 Thread Joseph Torres
Is there some transformation we'd want to apply to that tree, but can't because we have no concept of scope? It's already possible for a plan rule to traverse each node's subtree if it wants. On Tue, Apr 24, 2018 at 10:18 AM, Marco Gaido wrote: > Hi all, > > working on SPARK-24051 I realized tha

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Joseph Torres
The fundamental difficulty seems to be that there's a spurious "round-trip" in the API. Spark inspects the source to determine what type it's going to provide, picks an appropriate method according to that type, and then calls that method on the source to finally get what it wants. Pushing this out

Re: Spark 2.3 V2 Datasource API questions

2018-04-06 Thread Joseph Torres
Thanks for trying it out! We haven't hooked continuous streaming up to query.status or query.recentProgress yet - commit() should be called under the hood, we just don't yet report that it is. I've filed SPARK-23886 and SPARK-23887 to track the work to add those things. The issue with printing wa

Re: Beginner searching for guidance with Jira and issues

2018-03-20 Thread Joseph Torres
Hi! I can't speak for the other tasks, but SPARK-23444 I'd expect to be pretty complicated. It's not obvious what the right strategy is, and there's a bunch of minor stuff that needs to be cleaned up (e.g. tasks shouldn't print cancellation warnings when cancellation is expected). If you're inter

[DISCUSS] Structured Streaming writers in DataSourceV2

2018-03-09 Thread Joseph Torres
Hi all, I've been working for the past few months on figuring out a DataSourceV2 compatible interface for Structured Streaming sinks. I've written a document proposing an overarching design; please take a look and le

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Joseph Torres
SPARK-23221 fixes an issue specific to KafkaContinuousSourceStressForDontFailOnDataLossSuite; I don't think it could cause other suites to deadlock. Do note that the previous hang issues we saw caused by SPARK-23055 were correctly marked as failures. On Thu, Jan 25, 2018 at 3:40 PM, Shixiong(Ryan