Re: [build system] jenkins master unreachable, build system currently down

2018-04-30 Thread Xiao Li
Hi, Shane, Thank you! Xiao 2018-04-30 20:27 GMT-07:00 shane knapp : > we just noticed that we're unable to connect to jenkins, and have reached > out to our NOC support staff at our colo. until we hear back, there's > nothing we can do. > > i'll update the list as soon as

[build system] jenkins master unreachable, build system currently down

2018-04-30 Thread shane knapp
we just noticed that we're unable to connect to jenkins, and have reached out to our NOC support staff at our colo. until we hear back, there's nothing we can do. i'll update the list as soon as i hear something. sorry for the inconvenience! shane -- Shane Knapp UC Berkeley EECS Research /

Identifying specific persisted DataFrames via getPersistentRDDs()

2018-04-30 Thread Nicholas Chammas
This seems to be an underexposed part of the API. My use case is this: I want to unpersist all DataFrames except a specific few. I want to do this because I know at a specific point in my pipeline that I have a handful of DataFrames that I need, and everything else is no longer needed. The

Re: Sorting on a streaming dataframe

2018-04-30 Thread Michael Armbrust
Please open a JIRA then! On Fri, Apr 27, 2018 at 3:59 AM Hemant Bhanawat wrote: > I see. > > monotonically_increasing_id on streaming dataFrames will be really helpful > to me and I believe to many more users. Adding this functionality in Spark > would be efficient in

re: sharing data via kafka broker using spark streaming/ AnalysisException on collect()

2018-04-30 Thread Peter Liu
Hello there, I have a quick question regarding how to share data (a small data collection) between a kafka producer and consumer using spark streaming (spark 2.2): (A) the data published by a kafka producer is received in order on the kafka consumer side (see (a) copied below). (B) however,

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres
I'd argue that letting bad cases influence the design is an explicit goal of DataSourceV2. One of the primary motivations for the project was that file sources hook into a series of weird internal side channels, with favorable performance characteristics that are difficult to match in the API we

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
Should we really plan the API for a source with state that grows indefinitely? It sounds like we're letting a bad case influence the design, when we probably shouldn't. On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres < joseph.tor...@databricks.com> wrote: > Offset is just a type alias for

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres
Offset is just a type alias for arbitrary JSON-serializable state. Most implementations should (and do) just toss the blob at Spark and let Spark handle recovery on its own. In the case of file streams, the obstacle is that the conceptual offset is very large: a list of every file which the

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
Why don't we just have the source return a Serializable of state when it reports offsets? Then Spark could handle storing the source's state and the source wouldn't need to worry about file system paths. I think that would be easier for implementations and better for recovery because it wouldn't

Re: [Kubernetes] structured-streaming driver restarts / roadmap

2018-04-30 Thread Oz Ben Ami
This would be useful to us, so I've created a JIRA ticket for this discussion: https://issues.apache.org/jira/browse/SPARK-24122 On Wed, Mar 28, 2018 at 10:28 AM, Anirudh Ramanathan < ramanath...@google.com.invalid> wrote: > We discussed this early on in our fork and I think we should have this