UnresolvedException: Invalid call to dataType on unresolved object

2018-05-01 Thread 880f0464
Hi Everyone,

I wonder If someone could be so kind and share some light on this problem:

[UnresolvedException: Invalid call to dataType on unresolved object when using 
DataSet constructed from Seq.empty (since Spark 
2.3.0)](https://stackoverflow.com/q/49757487)

Cheers,
A.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

spark.python.worker.reuse not working as expected

2018-05-01 Thread 880f0464
Hi Everyone,

I wonder If someone could be so kind and share some light on this problem:

[spark.python.worker.reuse not working as 
expected](https://stackoverflow.com/q/50043684)

Cheers,
A.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

PySpark.sql.filter not performing as it should

2018-05-01 Thread 880f0464
Hi Everyone,

I wonder If someone could be so kind and share some light on this problem:

[PySpark.sql.filter not performing as it 
should](https://stackoverflow.com/q/49995538)

Cheers,
A.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar
Hi

I am getting the above error in Spark SQL . I have increase (using 5000 )
number of partitions but still getting the same error .

My data most probably is skew.



org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)


ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen

Dear Apache Enthusiast,

We are pleased to announce our schedule for ApacheCon North America 
2018. ApacheCon will be held September 23-27 at the Montreal Marriott 
Chateau Champlain in Montreal, Canada.


Registration is open! The early bird rate of $575 lasts until July 21, 
at which time it goes up to $800. And the room block at the Marriott 
($225 CAD per night, including wifi) closes on August 24th.


We will be featuring more than 100 sessions on Apache projects. The 
schedule is now online at https://apachecon.com/acna18/


The schedule includes full tracks of content from Cloudstack[1], 
Tomcat[2], and our GeoSpatial community[3].


We will have 4 keynote speakers, two of whom are Apache members, and two 
from the wider community.


On Tuesday, Apache member and former board member Cliff Schmidt will be 
speaking about how Amplio uses technology to educate and improve the 
quality of life of people living in very difficult parts of the 
world[4]. And Apache Fineract VP Myrle Krantz will speak about how Open 
Source banking is helping the global fight against poverty[5].


Then, on Wednesday, we’ll hear from Bridget Kromhout, Principal Cloud 
Developer Advocate from Microsoft, about the really hard problem in 
software - the people[6]. And Euan McLeod, ‎VP VIPER at ‎Comcast will 
show us the many ways that Apache software delivers your favorite shows 
to your living room[7].


ApacheCon will also feature old favorites like the Lightning Talks, the 
Hackathon (running the duration of the event), PGP key signing, and lots 
of hallway-track time to get to know your project community better.


Follow us on Twitter, @ApacheCon, and join the disc...@apachecon.com 
mailing list (send email to discuss-subscr...@apachecon.com) to stay up 
to date with developments. And if your company wants to sponsor this 
event, get in touch at h...@apachecon.com for opportunities that are 
still available.


See you in Montreal!

Rich Bowen
VP Conferences, The Apache Software Foundation
h...@apachecon.com
@ApacheCon

[1] http://cloudstackcollab.org/
[2] http://tomcat.apache.org/conference.html
[3] http://apachecon.dukecon.org/acna/2018/#/schedule?search=geospatial
[4] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/df977fd305a31b903
[5] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/22c6c30412a3828d6
[6] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/fbbb2384fa91ebc6b
[7] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/88d50c3613852c2de


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread shane knapp
and we're back!  there was apparently a firewall migration yesterday that
went sideways.

shane

On Mon, Apr 30, 2018 at 8:27 PM, shane knapp  wrote:

> we just noticed that we're unable to connect to jenkins, and have reached
> out to our NOC support staff at our colo.  until we hear back, there's
> nothing we can do.
>
> i'll update the list as soon as i hear something.  sorry for the
> inconvenience!
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Xiao Li
Thank you very much, Shane! Yeah, it works now!

Xiao


2018-05-01 8:40 GMT-07:00 shane knapp :

> and we're back!  there was apparently a firewall migration yesterday that
> went sideways.
>
> shane
>
> On Mon, Apr 30, 2018 at 8:27 PM, shane knapp  wrote:
>
>> we just noticed that we're unable to connect to jenkins, and have reached
>> out to our NOC support staff at our colo.  until we hear back, there's
>> nothing we can do.
>>
>> i'll update the list as soon as i hear something.  sorry for the
>> inconvenience!
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
This is usually caused by skew. Sometimes you can work around it by in
creasing the number of partitions like you tried, but when that doesn’t
work you need to change the partitioning that you’re using.

If you’re aggregating, try adding an intermediate aggregation. For example,
if your query is select sum(x), a from t group by a, then try select
sum(partial), a from (select sum(x) as partial, a, b from t group by a, b)
group by a.

rb
​

On Tue, May 1, 2018 at 4:21 AM, Pralabh Kumar 
wrote:

> Hi
>
> I am getting the above error in Spark SQL . I have increase (using 5000 )
> number of partitions but still getting the same error .
>
> My data most probably is skew.
>
>
>
> org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Joseph Bradley
Thank you Shane!!

On Tue, May 1, 2018 at 8:58 AM, Xiao Li  wrote:

> Thank you very much, Shane! Yeah, it works now!
>
> Xiao
>
>
> 2018-05-01 8:40 GMT-07:00 shane knapp :
>
>> and we're back!  there was apparently a firewall migration yesterday that
>> went sideways.
>>
>> shane
>>
>> On Mon, Apr 30, 2018 at 8:27 PM, shane knapp  wrote:
>>
>>> we just noticed that we're unable to connect to jenkins, and have
>>> reached out to our NOC support staff at our colo.  until we hear back,
>>> there's nothing we can do.
>>>
>>> i'll update the list as soon as i hear something.  sorry for the
>>> inconvenience!
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: Datasource API V2 and checkpointing

2018-05-01 Thread Ryan Blue
I think there's a difference. You're right that we wanted to clean up the
API in V2 to avoid file sources using side channels. But there's a big
difference between adding, for example, a way to report partitioning and
designing for sources that need unbounded state. It's a judgment call, but
I think unbounded state is definitely not something that we should design
around. Another way to think about it: yes, we want to design a better API
using existing sources as guides, but we don't need to assume that
everything those sources do should to be supported. It is reasonable to say
that this is a case we don't want to design for and the source needs to
change. Why can't we use a high watermark of files' modified timestamps?

For most sources, I think Spark should handle state serialization and
recovery. Maybe we can find a good way to make the file source with
unbounded state work, but this shouldn't be one of the driving cases for
the design and consequently a reason for every source to need to manage its
own state in a checkpoint directory.

rb

On Mon, Apr 30, 2018 at 12:37 PM, Joseph Torres <
joseph.tor...@databricks.com> wrote:

> I'd argue that letting bad cases influence the design is an explicit goal
> of DataSourceV2. One of the primary motivations for the project was that
> file sources hook into a series of weird internal side channels, with
> favorable performance characteristics that are difficult to match in the
> API we actually declare to Spark users. So a design that we can't migrate
> file sources to without a side channel would be worrying; won't we end up
> regressing to the same situation?
>
> On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue  wrote:
>
>> Should we really plan the API for a source with state that grows
>> indefinitely? It sounds like we're letting a bad case influence the
>> design, when we probably shouldn't.
>>
>> On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres <
>> joseph.tor...@databricks.com> wrote:
>>
>>> Offset is just a type alias for arbitrary JSON-serializable state. Most
>>> implementations should (and do) just toss the blob at Spark and let Spark
>>> handle recovery on its own.
>>>
>>> In the case of file streams, the obstacle is that the conceptual offset
>>> is very large: a list of every file which the stream has ever read. In
>>> order to parse this efficiently, the stream connector needs detailed
>>> control over how it's stored; the current implementation even has complex
>>> compactification and retention logic.
>>>
>>>
>>> On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue  wrote:
>>>
 Why don't we just have the source return a Serializable of state when
 it reports offsets? Then Spark could handle storing the source's state and
 the source wouldn't need to worry about file system paths. I think that
 would be easier for implementations and better for recovery because it
 wouldn't leave unknown state on a single machine's file system.

 rb

 On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres <
 joseph.tor...@databricks.com> wrote:

> The precise interactions with the DataSourceV2 API haven't yet been
> hammered out in design. But much of this comes down to the core of
> Structured Streaming rather than the API details.
>
> The execution engine handles checkpointing and recovery. It asks the
> streaming data source for offsets, and then determines that batch N
> contains the data between offset A and offset B. On recovery, if batch N
> needs to be re-run, the execution engine just asks the source for the same
> offset range again. Sources also get a handle to their own subfolder of 
> the
> checkpoint, which they can use as scratch space if they need. For example,
> Spark's FileStreamReader keeps a log of all the files it's seen, so its
> offsets can be simply indices into the log rather than huge strings
> containing all the paths.
>
> SPARK-23323 is orthogonal. That commit coordinator is responsible for
> ensuring that, within a single Spark job, two different tasks can't commit
> the same partition.
>
> On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
>> Wondering if this issue is related to SPARK-23323?
>>
>>
>>
>> Any pointers will be greatly appreciated….
>>
>>
>>
>> Thanks,
>>
>> Jayesh
>>
>>
>>
>> *From: *"Thakrar, Jayesh" 
>> *Date: *Monday, April 23, 2018 at 9:49 PM
>> *To: *"dev@spark.apache.org" 
>> *Subject: *Datasource API V2 and checkpointing
>>
>>
>>
>> I was wondering when checkpointing is enabled, who does the actual
>> work?
>>
>> The streaming datasource or the execution engine/driver?
>>
>>
>>
>> I have written a small/trivial datasource that just generates strings.
>>
>> After enabling checkpointing, I do see a folder being created under
>> the checkpo

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Joseph Torres
I agree that Spark should fully handle state serialization and recovery for
most sources. This is how it works in V1, and we definitely wouldn't want
or need to change that in V2.* The question is just whether we should have
an escape hatch for the sources that don't want Spark to do that, and if so
what the escape hatch should look like.

I don't think a watermark checkpoint would work, because there's no
guarantee (especially considering the "maxFilesPerTrigger" option) that all
files with the same timestamp will be in the same batch. But in general,
hanging the fundamental mechanics of how file sources take checkpoints
seems like it would impose a serious risk of performance regressions, which
I don't think are a desirable risk when performing an API migration that's
going to swap out users' queries from under them. I would be very
uncomfortable merging a V2 file source which we can't confidently assert
has the same performance characteristics as the existing one.


* Technically, most current sources do write their initial offset to the
checkpoint directory, but this is just a workaround to the fact that the V1
API has no handle to give Spark the initial offset. So if you e.g. start a
Kafka stream from latest offsets, and it fails in the first batch, Spark
won't know to restart the stream from the initial offset which was
originally generated. That's easily fixable in V2, and then no source will
have to even look at the checkpoint directory if it doesn't want to.

On Tue, May 1, 2018 at 10:26 AM, Ryan Blue  wrote:

> I think there's a difference. You're right that we wanted to clean up the
> API in V2 to avoid file sources using side channels. But there's a big
> difference between adding, for example, a way to report partitioning and
> designing for sources that need unbounded state. It's a judgment call, but
> I think unbounded state is definitely not something that we should design
> around. Another way to think about it: yes, we want to design a better API
> using existing sources as guides, but we don't need to assume that
> everything those sources do should to be supported. It is reasonable to say
> that this is a case we don't want to design for and the source needs to
> change. Why can't we use a high watermark of files' modified timestamps?
>
> For most sources, I think Spark should handle state serialization and
> recovery. Maybe we can find a good way to make the file source with
> unbounded state work, but this shouldn't be one of the driving cases for
> the design and consequently a reason for every source to need to manage its
> own state in a checkpoint directory.
>
> rb
>
> On Mon, Apr 30, 2018 at 12:37 PM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> I'd argue that letting bad cases influence the design is an explicit goal
>> of DataSourceV2. One of the primary motivations for the project was that
>> file sources hook into a series of weird internal side channels, with
>> favorable performance characteristics that are difficult to match in the
>> API we actually declare to Spark users. So a design that we can't migrate
>> file sources to without a side channel would be worrying; won't we end up
>> regressing to the same situation?
>>
>> On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue  wrote:
>>
>>> Should we really plan the API for a source with state that grows
>>> indefinitely? It sounds like we're letting a bad case influence the
>>> design, when we probably shouldn't.
>>>
>>> On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres <
>>> joseph.tor...@databricks.com> wrote:
>>>
 Offset is just a type alias for arbitrary JSON-serializable state. Most
 implementations should (and do) just toss the blob at Spark and let Spark
 handle recovery on its own.

 In the case of file streams, the obstacle is that the conceptual offset
 is very large: a list of every file which the stream has ever read. In
 order to parse this efficiently, the stream connector needs detailed
 control over how it's stored; the current implementation even has complex
 compactification and retention logic.


 On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue  wrote:

> Why don't we just have the source return a Serializable of state when
> it reports offsets? Then Spark could handle storing the source's state and
> the source wouldn't need to worry about file system paths. I think that
> would be easier for implementations and better for recovery because it
> wouldn't leave unknown state on a single machine's file system.
>
> rb
>
> On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> The precise interactions with the DataSourceV2 API haven't yet been
>> hammered out in design. But much of this comes down to the core of
>> Structured Streaming rather than the API details.
>>
>> The execution engine handles checkpointing and recovery. It asks the
>> streaming 

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Thakrar, Jayesh
Just wondering-

Given that currently V2 is less performant because of use of Row vs InternalRow 
(and other things?), is still evolving, and is missing some of the other 
features of V1, it might help to focus on remediating those features and then 
look at porting the filesources over.

As for the escape hatch (or additional capabilities), can that be implemented 
as traits?

And imho, i think filesources and other core sources should have the same 
citizenship level as us granted to the other sources in V2. This is so that 
others can use then as good references for emulation.

Jayesh


From: Joseph Torres 
Sent: Tuesday, May 1, 2018 1:58:54 PM
To: Ryan Blue
Cc: Thakrar, Jayesh; dev@spark.apache.org
Subject: Re: Datasource API V2 and checkpointing

I agree that Spark should fully handle state serialization and recovery for 
most sources. This is how it works in V1, and we definitely wouldn't want or 
need to change that in V2.* The question is just whether we should have an 
escape hatch for the sources that don't want Spark to do that, and if so what 
the escape hatch should look like.

I don't think a watermark checkpoint would work, because there's no guarantee 
(especially considering the "maxFilesPerTrigger" option) that all files with 
the same timestamp will be in the same batch. But in general, hanging the 
fundamental mechanics of how file sources take checkpoints seems like it would 
impose a serious risk of performance regressions, which I don't think are a 
desirable risk when performing an API migration that's going to swap out users' 
queries from under them. I would be very uncomfortable merging a V2 file source 
which we can't confidently assert has the same performance characteristics as 
the existing one.


* Technically, most current sources do write their initial offset to the 
checkpoint directory, but this is just a workaround to the fact that the V1 API 
has no handle to give Spark the initial offset. So if you e.g. start a Kafka 
stream from latest offsets, and it fails in the first batch, Spark won't know 
to restart the stream from the initial offset which was originally generated. 
That's easily fixable in V2, and then no source will have to even look at the 
checkpoint directory if it doesn't want to.

On Tue, May 1, 2018 at 10:26 AM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:
I think there's a difference. You're right that we wanted to clean up the API 
in V2 to avoid file sources using side channels. But there's a big difference 
between adding, for example, a way to report partitioning and designing for 
sources that need unbounded state. It's a judgment call, but I think unbounded 
state is definitely not something that we should design around. Another way to 
think about it: yes, we want to design a better API using existing sources as 
guides, but we don't need to assume that everything those sources do should to 
be supported. It is reasonable to say that this is a case we don't want to 
design for and the source needs to change. Why can't we use a high watermark of 
files' modified timestamps?

For most sources, I think Spark should handle state serialization and recovery. 
Maybe we can find a good way to make the file source with unbounded state work, 
but this shouldn't be one of the driving cases for the design and consequently 
a reason for every source to need to manage its own state in a checkpoint 
directory.

rb

On Mon, Apr 30, 2018 at 12:37 PM, Joseph Torres 
mailto:joseph.tor...@databricks.com>> wrote:
I'd argue that letting bad cases influence the design is an explicit goal of 
DataSourceV2. One of the primary motivations for the project was that file 
sources hook into a series of weird internal side channels, with favorable 
performance characteristics that are difficult to match in the API we actually 
declare to Spark users. So a design that we can't migrate file sources to 
without a side channel would be worrying; won't we end up regressing to the 
same situation?

On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:
Should we really plan the API for a source with state that grows indefinitely? 
It sounds like we're letting a bad case influence the design, when we probably 
shouldn't.

On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres 
mailto:joseph.tor...@databricks.com>> wrote:
Offset is just a type alias for arbitrary JSON-serializable state. Most 
implementations should (and do) just toss the blob at Spark and let Spark 
handle recovery on its own.

In the case of file streams, the obstacle is that the conceptual offset is very 
large: a list of every file which the stream has ever read. In order to parse 
this efficiently, the stream connector needs detailed control over how it's 
stored; the current implementation even has complex compactification and 
retention logic.


On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:
Why don't we just have the source retur

Re: Sorting on a streaming dataframe

2018-05-01 Thread Hemant Bhanawat
Opened an issue. https://issues.apache.org/jira/browse/SPARK-24144

Since it is a Major issue for us, I have marked it as Major issue. Feel
free to change if that is not the case from Spark's perspective.

On Tue, May 1, 2018 at 4:34 AM, Michael Armbrust 
wrote:

> Please open a JIRA then!
>
> On Fri, Apr 27, 2018 at 3:59 AM Hemant Bhanawat 
> wrote:
>
>> I see.
>>
>> monotonically_increasing_id on streaming dataFrames will be really
>> helpful to me and I believe to many more users. Adding this functionality
>> in Spark would be efficient in terms of performance as compared to
>> implementing this functionality inside the applications.
>>
>> Hemant
>>
>> On Thu, Apr 26, 2018 at 11:59 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> The basic tenet of structured streaming is that a query should return
>>> the same answer in streaming or batch mode. We support sorting in complete
>>> mode because we have all the data and can sort it correctly and return the
>>> full answer.  In update or append mode, sorting would only return a correct
>>> answer if we could promise that records that sort lower are going to arrive
>>> later (and we can't).  Therefore, it is disallowed.
>>>
>>> If you are just looking for a unique, stable id and you are already
>>> using kafka as the source, you could just combine the partition id and the
>>> offset. The structured streaming connector to Kafka
>>> 
>>> exposes both of these in the schema of the streaming DataFrame. (similarly
>>> for kinesis you can use the shard id and sequence number)
>>>
>>> If you need the IDs to be contiguous, then this is a somewhat
>>> fundamentally hard problem.  I think the best we could do is add support
>>> for monotonically_increasing_id() in streaming dataframes.
>>>
>>> On Tue, Apr 24, 2018 at 1:38 PM, Chayapan Khannabha 
>>> wrote:
>>>
 Perhaps your use case fits to Apache Kafka better.

 More info at:
 https://kafka.apache.org/documentation/streams/

 Everything really comes down to the architecture design and algorithm
 spec. However, from my experience with Spark, there are many good reasons
 why this requirement is not supported ;)

 Best,

 Chayapan (A)


 On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat 
 wrote:

 Thanks Chris. There are many ways in which I can solve this problem but
 they are cumbersome. The easiest way would have been to sort the streaming
 dataframe. The reason I asked this question is because I could not find a
 reason why sorting on streaming dataframe is disallowed.

 Hemant

 On Mon, Apr 16, 2018 at 6:09 PM, Bowden, Chris <
 chris.bow...@microfocus.com> wrote:

> You can happily sort the underlying RDD of InternalRow(s) inside a
> sink, assuming you are willing to implement and maintain your own sink(s).
> That is, just grabbing the parquet sink, etc. isn’t going to work out of
> the box. Alternatively map/flatMapGroupsWithState is probably sufficient
> and requires less working knowledge to make effective reuse of internals.
> Just group by foo and then sort accordingly and assign ids. The id counter
> can be stateful per group. Sometimes this problem may not need to be 
> solved
> at all. For example, if you are using kafka, a proper partitioning scheme
> and message offsets may be “good enough”.
> --
> *From:* Hemant Bhanawat 
> *Sent:* Thursday, April 12, 2018 11:42:59 PM
> *To:* Reynold Xin
> *Cc:* dev
> *Subject:* Re: Sorting on a streaming dataframe
>
> Well, we want to assign snapshot ids (incrementing counters) to the
> incoming records. For that, we are zipping the streaming rdds with that
> counter using a modified version of ZippedWithIndexRDD. We are ok if the
> records in the streaming dataframe gets counters in random order but the
> counter should always be incrementing.
>
> This is working fine until we have a failure. When we have a failure,
> we re-assign the records to snapshot ids  and this time same snapshot id
> can get assigned to a different record. This is a problem because the
> primary key in our storage engine is . So we want to
> sort the dataframe so that the records always get the same snapshot id.
>
>
>
> On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin 
> wrote:
>
> Can you describe your use case more?
>
> On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat 
> wrote:
>
> Hi Guys,
>
> Why is sorting on streaming dataframes not supported(unless it is
> complete mode)? My downstream needs me to sort the streaming dataframe.
>
> Hemant
>
>
>


>>>
>>