Re: get method guid prefix for file parts for write

2020-09-25 Thread gpongracz
What Nick said was correct.

What I should also state is that I am using python spark variant in this
case not the scala.

I am looking to use the guid prefix of part-0 to prevent a race condition by
using a s3 waiter for the part to appear, but to achieve this, I need to
know the guid value in advance.

Thank you all again for your help.

Regards,

George



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: get method guid prefix for file parts for write

2020-09-25 Thread gpongracz
I should add that I tried using a waiter on the _SUCCESS file but it did not
prove successful as due to its small size compared to the part-0 file it
seems to be appearing before the part-0 file in s3, even though it was
written afterwards.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-25 Thread Jungtaek Lim
Thanks Etienne! Yeah I forgot to say nice talking with you again. And sorry
I forgot to send the reply (was in draft).

Regarding investment in SS, well, unfortunately I don't know - I'm just an
individual. There might be various reasons to do so, most probably
"priority" among the stuff. There's not much I could change.

I agree the workaround is sub-optimal, but unless I see sufficient support
in the community I probably couldn't make it go forward. I'll just say
there's an elephant in the room - as the project goes forward for more than
10 years, backward compatibility is a top priority concern in the project,
even across the major versions along the features/APIs. It is great for end
users to migrate the version easily, but also blocks devs to fix the bad
design once it ships. I'm the one complaining about these issues in the dev
list, and I don't see willingness to correct them.


On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot 
wrote:

> Hi Jungtaek Lim,
>
> Nice to hear from you again since last time we talked :) and congrats on
> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
> not at the time)
>
> I totally agree with what you're saying on merging structural parts of
> Spark without having a broader consensus. What I don't understand is why
> there is not more investment in SS. Especially because in another thread
> the community is discussing about deprecating the regular DStream streaming
> framework.
>
> Is the orientation of Spark now mostly batch ?
>
> PS: yeah I saw your update on the doc when I took a look at 3.0 preview 2
> searching for this particular feature. And regarding the workaround, I'm
> not sure it meets my needs as it will add delays and also may mess up with
> watermarks.
>
> Best
>
> Etienne Chauchot
>
>
> On 04/09/2020 08:06, Jungtaek Lim wrote:
>
> Unfortunately I don't see enough active committers working on Structured
> Streaming; I don't expect major features/improvements can be brought in
> this situation.
>
> Technically I can review and merge the PR on major improvements in SS, but
> that depends on how huge the proposal is changing. If the proposal brings
> conceptual change, being reviewed by a committer wouldn't still be enough.
>
> So that's not due to the fact we think it's worthless. (That might be only
> me though.) I'd understand as there's not much investment on SS. There's
> also a known workaround for multiple aggregations (I've documented in the
> SS guide doc, in "Limitation of global watermark" section), though I
> totally agree the workaround is bad.
>
> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot 
> wrote:
>
>> Hi all,
>>
>> I'm also very interested in this feature but the PR is open since January
>> 2019 and was not updated. It raised a design discussion around watermarks
>> and a design doc was written (
>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>> We also commented this design but no matter what it seems that the subject
>> is still stale.
>>
>> Is there any interest in the community in delivering this feature or is
>> it considered worthless ? If the latter, can you explain why ?
>>
>> Best
>>
>> Etienne
>> On 22/05/2019 03:38, 张万新 wrote:
>>
>> Thanks, I'll check it out.
>>
>> Arun Mahadevan  于 2019年5月21日周二 01:31写道:
>>
>>> Heres the proposal for supporting it in "append" mode -
>>> https://github.com/apache/spark/pull/23576. You could see if it
>>> addresses your requirement and post your feedback in the PR.
>>> For "update" mode its going to be much harder to support this without
>>> first adding support for "retractions", otherwise we would end up with
>>> wrong results.
>>>
>>> - Arun
>>>
>>>
>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi 
>>> wrote:
>>>
 There is PR for this but not yet merged.

 On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:

> Hi there,
>
> I'd like to know what's the root reason why multiple aggregations on
> streaming dataframe is not allowed since it's a very useful feature, and
> flink has supported it for a long time.
>
> Thanks.
>



Re: get method guid prefix for file parts for write

2020-09-25 Thread Nicholas Chammas
I think what George is looking for is a way to determine ahead of time the
partition IDs that Spark will use when writing output.

George,

I believe this is an example of what you're looking for:
https://github.com/databricks/spark-redshift/blob/184b4428c1505dff7b4365963dc344197a92baa9/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L240-L257

Specifically, the part that says "TaskContext.get.partitionId()".

I don't know how much of that is part of Spark's public API, but there it
is.

It would be useful if Spark offered a way to get a manifest of output files
for any given write operation, similar to Redshift's MANIFEST option
. This would
help when, for example, you need to pass a list of files output by Spark to
some other system (like Redshift) and don't want to have to worry about the
consistency guarantees of your object store's list operations.

Nick

On Fri, Sep 25, 2020 at 2:00 PM EveLiao  wrote:

> If I understand your problem correctly, the prefix you provided is actually
> "-" + UUID. You can get it by uuid generator like
> https://docs.python.org/3/library/uuid.html#uuid.uuid4.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: get method guid prefix for file parts for write

2020-09-25 Thread EveLiao
If I understand your problem correctly, the prefix you provided is actually
"-" + UUID. You can get it by uuid generator like
https://docs.python.org/3/library/uuid.html#uuid.uuid4.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org