date:20180430

Re: [build system] jenkins master unreachable, build system currently down

2018-04-30 Thread Xiao Li

Hi, Shane,

Thank you!

Xiao

2018-04-30 20:27 GMT-07:00 shane knapp :

> we just noticed that we're unable to connect to jenkins, and have reached
> out to our NOC support staff at our colo.  until we hear back, there's
> nothing we can do.
>
> i'll update the list as soon as i hear something.  sorry for the
> inconvenience!
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

[build system] jenkins master unreachable, build system currently down

2018-04-30 Thread shane knapp

we just noticed that we're unable to connect to jenkins, and have reached
out to our NOC support staff at our colo.  until we hear back, there's
nothing we can do.

i'll update the list as soon as i hear something.  sorry for the
inconvenience!

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Identifying specific persisted DataFrames via getPersistentRDDs()

2018-04-30 Thread Nicholas Chammas

This seems to be an underexposed part of the API. My use case is this: I
want to unpersist all DataFrames except a specific few. I want to do this
because I know at a specific point in my pipeline that I have a handful of
DataFrames that I need, and everything else is no longer needed.

The problem is that there doesn’t appear to be a way to identify specific
DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
which is the only way I’m aware of to ask Spark for all currently persisted
RDDs:

>>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
[(3, JavaObject id=o36)]

As you can see, the id of the persisted RDD, 8, doesn’t match the id
returned by getPersistentRDDs(), 3. So I can’t go through the RDDs returned
by getPersistentRDDs() and know which ones I want to keep.

id() itself appears to be an undocumented method of the RDD API, and in
PySpark getPersistentRDDs() is buried behind the Java sub-objects
, so I know I’m reaching
here. But is there a way to do what I want in PySpark without manually
tracking everything I’ve persisted myself?

And more broadly speaking, do we want to add additional APIs, or formalize
currently undocumented APIs like id(), to make this use case possible?

Nick

Re: Sorting on a streaming dataframe

2018-04-30 Thread Michael Armbrust

Please open a JIRA then!

On Fri, Apr 27, 2018 at 3:59 AM Hemant Bhanawat 
wrote:

> I see.
>
> monotonically_increasing_id on streaming dataFrames will be really helpful
> to me and I believe to many more users. Adding this functionality in Spark
> would be efficient in terms of performance as compared to implementing this
> functionality inside the applications.
>
> Hemant
>
> On Thu, Apr 26, 2018 at 11:59 PM, Michael Armbrust  > wrote:
>
>> The basic tenet of structured streaming is that a query should return the
>> same answer in streaming or batch mode. We support sorting in complete mode
>> because we have all the data and can sort it correctly and return the full
>> answer.  In update or append mode, sorting would only return a correct
>> answer if we could promise that records that sort lower are going to arrive
>> later (and we can't).  Therefore, it is disallowed.
>>
>> If you are just looking for a unique, stable id and you are already using
>> kafka as the source, you could just combine the partition id and the
>> offset. The structured streaming connector to Kafka
>> 
>> exposes both of these in the schema of the streaming DataFrame. (similarly
>> for kinesis you can use the shard id and sequence number)
>>
>> If you need the IDs to be contiguous, then this is a somewhat
>> fundamentally hard problem.  I think the best we could do is add support
>> for monotonically_increasing_id() in streaming dataframes.
>>
>> On Tue, Apr 24, 2018 at 1:38 PM, Chayapan Khannabha 
>> wrote:
>>
>>> Perhaps your use case fits to Apache Kafka better.
>>>
>>> More info at:
>>> https://kafka.apache.org/documentation/streams/
>>>
>>> Everything really comes down to the architecture design and algorithm
>>> spec. However, from my experience with Spark, there are many good reasons
>>> why this requirement is not supported ;)
>>>
>>> Best,
>>>
>>> Chayapan (A)
>>>
>>>
>>> On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat 
>>> wrote:
>>>
>>> Thanks Chris. There are many ways in which I can solve this problem but
>>> they are cumbersome. The easiest way would have been to sort the streaming
>>> dataframe. The reason I asked this question is because I could not find a
>>> reason why sorting on streaming dataframe is disallowed.
>>>
>>> Hemant
>>>
>>> On Mon, Apr 16, 2018 at 6:09 PM, Bowden, Chris <
>>> chris.bow...@microfocus.com> wrote:
>>>
 You can happily sort the underlying RDD of InternalRow(s) inside a
 sink, assuming you are willing to implement and maintain your own sink(s).
 That is, just grabbing the parquet sink, etc. isn’t going to work out of
 the box. Alternatively map/flatMapGroupsWithState is probably sufficient
 and requires less working knowledge to make effective reuse of internals.
 Just group by foo and then sort accordingly and assign ids. The id counter
 can be stateful per group. Sometimes this problem may not need to be solved
 at all. For example, if you are using kafka, a proper partitioning scheme
 and message offsets may be “good enough”.
 --
 *From:* Hemant Bhanawat 
 *Sent:* Thursday, April 12, 2018 11:42:59 PM
 *To:* Reynold Xin
 *Cc:* dev
 *Subject:* Re: Sorting on a streaming dataframe

 Well, we want to assign snapshot ids (incrementing counters) to the
 incoming records. For that, we are zipping the streaming rdds with that
 counter using a modified version of ZippedWithIndexRDD. We are ok if the
 records in the streaming dataframe gets counters in random order but the
 counter should always be incrementing.

 This is working fine until we have a failure. When we have a failure,
 we re-assign the records to snapshot ids  and this time same snapshot id
 can get assigned to a different record. This is a problem because the
 primary key in our storage engine is . So we want to
 sort the dataframe so that the records always get the same snapshot id.



 On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin 
 wrote:

 Can you describe your use case more?

 On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat 
 wrote:

 Hi Guys,

 Why is sorting on streaming dataframes not supported(unless it is
 complete mode)? My downstream needs me to sort the streaming dataframe.

 Hemant



>>>
>>>
>>
>

re: sharing data via kafka broker using spark streaming/ AnalysisException on collect()

2018-04-30 Thread Peter Liu

Hello there,

I have a quick question regarding how to share data (a small data
collection) between a kafka producer and consumer using spark streaming
(spark 2.2):

(A)
the data published by a kafka producer is received in order on the kafka
consumer side (see (a) copied below).

(B)
however, collect() or cache() on a streaming dataframe does not seem to be
supported (see links in (b) below): i got this:
Exception in thread "DataProducer" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with writeStream.start();;

(C)
My question would be:

--- How can I use the collection data (on a streaming dataframe) arrived on
the consumer side, e.g convert it to an array of objects?
--- Maybe there's another quick way to use kafka for sharing static data
(instead of streaming) between two spark application services (without any
common spark context and session etc.)?

I have copied some code snippet in (c).

It seems to be a very simple use case scenario to share a global collection
between a spark producer and consumer. But I spent entire day to try
various options and go thru online resources such as
google-general/apache-spark/stackoverflow/cloudera/etc/etc.

Any help would be very much appreciated!

Thanks!

Peter

(a) streaming data (df) received on the consumer side (console sink):

root
 |-- ad_id: string (nullable = true)
 |-- campaign_id: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

---
Batch: 0
---
+++---+
|ad_id   |campaign_id
|timestamp  |
+++---+
|b5629b58-376e-462c-9e65-726184390c84|bf412fa4-aeaa-4148-8467-1e1e2a6e0945|2018-04-27
14:35:45.475|
|64e93f73-15bb-478c-9f96-fd38f6a24da2|bf412fa4-aeaa-4148-8467-1e1e2a6e0945|2018-04-27
14:35:45.475|
|05fa1349-fcb3-432e-9b58-2bb0559859a2|060810fd-0430-444f-808c-8a177613226a|2018-04-27
14:35:45.478|
|ae0a176e-236a-4d3a-acb9-141157e81568|42b68023-6a3a-4618-a54a-e6f71df26710|2018-04-27
14:35:45.484|

(b) online discussions on unsupported operations on streaming dataframe:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operatio...


https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries


(c) code snippet:

OK:

   val rawDf = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaCluster.kafkaNodesString)
  .option("startingOffsets", "earliest")
  .option("subscribe", Variables.CAMPAIGNS_TOPIC)
  .load()

OK:

val mySchema = StructType(Array(
  StructField("ad_id", StringType),
  StructField("campaign_id", StringType)))

val campaignsDf2 = campaignsDf.select(from_json($"value",
mySchema).as("data"), $"timestamp")
  .select("data.*", "timestamp")

OK:

   campaignsDf2.writeStream
.format("console")
.option("truncate","false")
.trigger(org.apache.spark.sql.streaming.Trigger.Once()) //trigger once
since this is a onetime static data
.awaitTermination()


Exception:
  val campaignsArrayRows = campaignsDf2.collect()  //< not
supported  > AnalysisException!

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres

I'd argue that letting bad cases influence the design is an explicit goal
of DataSourceV2. One of the primary motivations for the project was that
file sources hook into a series of weird internal side channels, with
favorable performance characteristics that are difficult to match in the
API we actually declare to Spark users. So a design that we can't migrate
file sources to without a side channel would be worrying; won't we end up
regressing to the same situation?

On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue  wrote:

> Should we really plan the API for a source with state that grows
> indefinitely? It sounds like we're letting a bad case influence the
> design, when we probably shouldn't.
>
> On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Offset is just a type alias for arbitrary JSON-serializable state. Most
>> implementations should (and do) just toss the blob at Spark and let Spark
>> handle recovery on its own.
>>
>> In the case of file streams, the obstacle is that the conceptual offset
>> is very large: a list of every file which the stream has ever read. In
>> order to parse this efficiently, the stream connector needs detailed
>> control over how it's stored; the current implementation even has complex
>> compactification and retention logic.
>>
>>
>> On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue  wrote:
>>
>>> Why don't we just have the source return a Serializable of state when it
>>> reports offsets? Then Spark could handle storing the source's state and the
>>> source wouldn't need to worry about file system paths. I think that would
>>> be easier for implementations and better for recovery because it wouldn't
>>> leave unknown state on a single machine's file system.
>>>
>>> rb
>>>
>>> On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres <
>>> joseph.tor...@databricks.com> wrote:
>>>
 The precise interactions with the DataSourceV2 API haven't yet been
 hammered out in design. But much of this comes down to the core of
 Structured Streaming rather than the API details.

 The execution engine handles checkpointing and recovery. It asks the
 streaming data source for offsets, and then determines that batch N
 contains the data between offset A and offset B. On recovery, if batch N
 needs to be re-run, the execution engine just asks the source for the same
 offset range again. Sources also get a handle to their own subfolder of the
 checkpoint, which they can use as scratch space if they need. For example,
 Spark's FileStreamReader keeps a log of all the files it's seen, so its
 offsets can be simply indices into the log rather than huge strings
 containing all the paths.

 SPARK-23323 is orthogonal. That commit coordinator is responsible for
 ensuring that, within a single Spark job, two different tasks can't commit
 the same partition.

 On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
 jthak...@conversantmedia.com> wrote:

> Wondering if this issue is related to SPARK-23323?
>
>
>
> Any pointers will be greatly appreciated….
>
>
>
> Thanks,
>
> Jayesh
>
>
>
> *From: *"Thakrar, Jayesh" 
> *Date: *Monday, April 23, 2018 at 9:49 PM
> *To: *"dev@spark.apache.org" 
> *Subject: *Datasource API V2 and checkpointing
>
>
>
> I was wondering when checkpointing is enabled, who does the actual
> work?
>
> The streaming datasource or the execution engine/driver?
>
>
>
> I have written a small/trivial datasource that just generates strings.
>
> After enabling checkpointing, I do see a folder being created under
> the checkpoint folder, but there's nothing else in there.
>
>
>
> Same question for write-ahead and recovery?
>
> And on a restart from a failed streaming session - who should set the
> offsets?
>
> The driver/Spark or the datasource?
>
>
>
> Any pointers to design docs would also be greatly appreciated.
>
>
>
> Thanks,
>
> Jayesh
>
>
>

>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue

Should we really plan the API for a source with state that grows
indefinitely? It sounds like we're letting a bad case influence the design,
when we probably shouldn't.

On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres <
joseph.tor...@databricks.com> wrote:

> Offset is just a type alias for arbitrary JSON-serializable state. Most
> implementations should (and do) just toss the blob at Spark and let Spark
> handle recovery on its own.
>
> In the case of file streams, the obstacle is that the conceptual offset is
> very large: a list of every file which the stream has ever read. In order
> to parse this efficiently, the stream connector needs detailed control over
> how it's stored; the current implementation even has complex
> compactification and retention logic.
>
>
> On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue  wrote:
>
>> Why don't we just have the source return a Serializable of state when it
>> reports offsets? Then Spark could handle storing the source's state and the
>> source wouldn't need to worry about file system paths. I think that would
>> be easier for implementations and better for recovery because it wouldn't
>> leave unknown state on a single machine's file system.
>>
>> rb
>>
>> On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres <
>> joseph.tor...@databricks.com> wrote:
>>
>>> The precise interactions with the DataSourceV2 API haven't yet been
>>> hammered out in design. But much of this comes down to the core of
>>> Structured Streaming rather than the API details.
>>>
>>> The execution engine handles checkpointing and recovery. It asks the
>>> streaming data source for offsets, and then determines that batch N
>>> contains the data between offset A and offset B. On recovery, if batch N
>>> needs to be re-run, the execution engine just asks the source for the same
>>> offset range again. Sources also get a handle to their own subfolder of the
>>> checkpoint, which they can use as scratch space if they need. For example,
>>> Spark's FileStreamReader keeps a log of all the files it's seen, so its
>>> offsets can be simply indices into the log rather than huge strings
>>> containing all the paths.
>>>
>>> SPARK-23323 is orthogonal. That commit coordinator is responsible for
>>> ensuring that, within a single Spark job, two different tasks can't commit
>>> the same partition.
>>>
>>> On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
>>> jthak...@conversantmedia.com> wrote:
>>>
 Wondering if this issue is related to SPARK-23323?



 Any pointers will be greatly appreciated….



 Thanks,

 Jayesh



 *From: *"Thakrar, Jayesh" 
 *Date: *Monday, April 23, 2018 at 9:49 PM
 *To: *"dev@spark.apache.org" 
 *Subject: *Datasource API V2 and checkpointing



 I was wondering when checkpointing is enabled, who does the actual work?

 The streaming datasource or the execution engine/driver?



 I have written a small/trivial datasource that just generates strings.

 After enabling checkpointing, I do see a folder being created under the
 checkpoint folder, but there's nothing else in there.



 Same question for write-ahead and recovery?

 And on a restart from a failed streaming session - who should set the
 offsets?

 The driver/Spark or the datasource?



 Any pointers to design docs would also be greatly appreciated.



 Thanks,

 Jayesh



>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres

Offset is just a type alias for arbitrary JSON-serializable state. Most
implementations should (and do) just toss the blob at Spark and let Spark
handle recovery on its own.

In the case of file streams, the obstacle is that the conceptual offset is
very large: a list of every file which the stream has ever read. In order
to parse this efficiently, the stream connector needs detailed control over
how it's stored; the current implementation even has complex
compactification and retention logic.


On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue  wrote:

> Why don't we just have the source return a Serializable of state when it
> reports offsets? Then Spark could handle storing the source's state and the
> source wouldn't need to worry about file system paths. I think that would
> be easier for implementations and better for recovery because it wouldn't
> leave unknown state on a single machine's file system.
>
> rb
>
> On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> The precise interactions with the DataSourceV2 API haven't yet been
>> hammered out in design. But much of this comes down to the core of
>> Structured Streaming rather than the API details.
>>
>> The execution engine handles checkpointing and recovery. It asks the
>> streaming data source for offsets, and then determines that batch N
>> contains the data between offset A and offset B. On recovery, if batch N
>> needs to be re-run, the execution engine just asks the source for the same
>> offset range again. Sources also get a handle to their own subfolder of the
>> checkpoint, which they can use as scratch space if they need. For example,
>> Spark's FileStreamReader keeps a log of all the files it's seen, so its
>> offsets can be simply indices into the log rather than huge strings
>> containing all the paths.
>>
>> SPARK-23323 is orthogonal. That commit coordinator is responsible for
>> ensuring that, within a single Spark job, two different tasks can't commit
>> the same partition.
>>
>> On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
>> jthak...@conversantmedia.com> wrote:
>>
>>> Wondering if this issue is related to SPARK-23323?
>>>
>>>
>>>
>>> Any pointers will be greatly appreciated….
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jayesh
>>>
>>>
>>>
>>> *From: *"Thakrar, Jayesh" 
>>> *Date: *Monday, April 23, 2018 at 9:49 PM
>>> *To: *"dev@spark.apache.org" 
>>> *Subject: *Datasource API V2 and checkpointing
>>>
>>>
>>>
>>> I was wondering when checkpointing is enabled, who does the actual work?
>>>
>>> The streaming datasource or the execution engine/driver?
>>>
>>>
>>>
>>> I have written a small/trivial datasource that just generates strings.
>>>
>>> After enabling checkpointing, I do see a folder being created under the
>>> checkpoint folder, but there's nothing else in there.
>>>
>>>
>>>
>>> Same question for write-ahead and recovery?
>>>
>>> And on a restart from a failed streaming session - who should set the
>>> offsets?
>>>
>>> The driver/Spark or the datasource?
>>>
>>>
>>>
>>> Any pointers to design docs would also be greatly appreciated.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jayesh
>>>
>>>
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue

Why don't we just have the source return a Serializable of state when it
reports offsets? Then Spark could handle storing the source's state and the
source wouldn't need to worry about file system paths. I think that would
be easier for implementations and better for recovery because it wouldn't
leave unknown state on a single machine's file system.

rb

On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres  wrote:

> The precise interactions with the DataSourceV2 API haven't yet been
> hammered out in design. But much of this comes down to the core of
> Structured Streaming rather than the API details.
>
> The execution engine handles checkpointing and recovery. It asks the
> streaming data source for offsets, and then determines that batch N
> contains the data between offset A and offset B. On recovery, if batch N
> needs to be re-run, the execution engine just asks the source for the same
> offset range again. Sources also get a handle to their own subfolder of the
> checkpoint, which they can use as scratch space if they need. For example,
> Spark's FileStreamReader keeps a log of all the files it's seen, so its
> offsets can be simply indices into the log rather than huge strings
> containing all the paths.
>
> SPARK-23323 is orthogonal. That commit coordinator is responsible for
> ensuring that, within a single Spark job, two different tasks can't commit
> the same partition.
>
> On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
>> Wondering if this issue is related to SPARK-23323?
>>
>>
>>
>> Any pointers will be greatly appreciated….
>>
>>
>>
>> Thanks,
>>
>> Jayesh
>>
>>
>>
>> *From: *"Thakrar, Jayesh" 
>> *Date: *Monday, April 23, 2018 at 9:49 PM
>> *To: *"dev@spark.apache.org" 
>> *Subject: *Datasource API V2 and checkpointing
>>
>>
>>
>> I was wondering when checkpointing is enabled, who does the actual work?
>>
>> The streaming datasource or the execution engine/driver?
>>
>>
>>
>> I have written a small/trivial datasource that just generates strings.
>>
>> After enabling checkpointing, I do see a folder being created under the
>> checkpoint folder, but there's nothing else in there.
>>
>>
>>
>> Same question for write-ahead and recovery?
>>
>> And on a restart from a failed streaming session - who should set the
>> offsets?
>>
>> The driver/Spark or the datasource?
>>
>>
>>
>> Any pointers to design docs would also be greatly appreciated.
>>
>>
>>
>> Thanks,
>>
>> Jayesh
>>
>>
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [Kubernetes] structured-streaming driver restarts / roadmap

2018-04-30 Thread Oz Ben Ami

This would be useful to us, so I've created a JIRA ticket for this
discussion: https://issues.apache.org/jira/browse/SPARK-24122

On Wed, Mar 28, 2018 at 10:28 AM, Anirudh Ramanathan <
ramanath...@google.com.invalid> wrote:

> We discussed this early on in our fork and I think we should have this in
> a JIRA and discuss it further. It's something we want to address in the
> future.
>
> One proposed method is using a StatefulSet of size 1 for the driver. This
> ensures recovery but at the same time takes away from the completion
> semantics of a single pod.
>
>
> See history in https://github.com/apache-spark-on-k8s/spark/issues/288
>
> On Wed, Mar 28, 2018, 6:56 AM Lucas Kacher  wrote:
>
>> A carry-over from the apache-spark-on-k8s project, it would be useful to
>> have a configurable restart policy for submitted jobs with the Kubernetes
>> resource manager. See the following issues:
>>
>> https://github.com/apache-spark-on-k8s/spark/issues/133
>> https://github.com/apache-spark-on-k8s/spark/issues/288
>> https://github.com/apache-spark-on-k8s/spark/issues/546
>>
>> Use case: I have a structured streaming job that reads from Kafka,
>> aggregates, and writes back out to Kafka deployed via k8s and checkpointing
>> to a remote location. If the driver pod dies for a any number of reasons,
>> it will not restart.
>>
>> For us, as all data is stored via checkpoint and we are satisfied with
>> at-least-once semantics, it would be useful if the driver were to come back
>> on it's own and pick back up.
>>
>> Firstly, may we add this to JIRA? Secondly, Is there any insight as to
>> what the thought is around allowing that to be configurable in the future?
>> If it's not something that will happen natively, we will end up needing to
>> write something that polls or listens to k8s and has the ability to
>> re-submit any failed jobs.
>>
>> Thanks!
>>
>> --
>>
>> *Lucas Kacher*Senior Engineer
>> -
>> vsco.co 
>> New York, NY
>> 818.512.5239
>>
>

Re: [build system] jenkins master unreachable, build system currently down

[build system] jenkins master unreachable, build system currently down

Identifying specific persisted DataFrames via getPersistentRDDs()

Re: Sorting on a streaming dataframe

re: sharing data via kafka broker using spark streaming/ AnalysisException on collect()

Re: Datasource API V2 and checkpointing

Re: Datasource API V2 and checkpointing

Re: Datasource API V2 and checkpointing

Re: Datasource API V2 and checkpointing

Re: [Kubernetes] structured-streaming driver restarts / roadmap

10 matches

Site Navigation

Mail list logo

Footer information