Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim Wed, 25 Jul 2018 23:54:01 -0700

I'd like to bump this again, since only one of 6 pull requests is merged (5
remaining), and others are not reviewed (non code style) from committers.


https://github.com/apache/spark/pulls/HeartSaVioR

All pull requests are related to Structured Streaming, and most of all are
already reviewed by couple of contributors. They're open for 17 days at
least and more than 2 months at most. I'm not persuading committers to
merge them in 2.4, but hope to see any reactions / reviews so that I can
hopefully reflect and take them forward to be ready to merge.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 16일 (월) 오후 1:04, Jungtaek Lim <[email protected]>님이 작성:

> Bump. I got couple of review comments from contributors including soft
> LGTM, but still haven't got any (non code style) review from committers, so
> technically haven't have any progress to be merged.
>
> I'm planning to work on adding new feature as well, but it's not easy for
> me to concentrate on something with also concerning to maintain 6 existing
> pull requests. Merge conflicts would be matter on maintaining, especially
> other pull requests (submitted later than my pull requests) are getting
> reviewed and merged.
>
> I'd like to ask any structured streaming related committer to take a look
> at pull requests.
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <[email protected]>님이 작성:
>
>> I recently added more test results to SPARK-24763 [1] which shows that
>> the proposal reduces state size according to the ratio of key-value size,
>> whereas there's no performance hit and sometimes even slight boost.
>>
>> Please refer the latest comment in JIRA issue [2] to see the numbers from
>> perf. tests.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24763
>> 2.
>> https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367
>>
>>
>> 2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <[email protected]>님이 작성:
>>
>>> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
>>> option to enable optimization of state size in streaming aggregation
>>> without hurting performance.
>>>
>>> The idea is to remove data for key fields from value which is duplicated
>>> between key and value in state row. This requires additional operations
>>> like projection and join, but smaller state row would also give performance
>>> benefit, which can offset each other.
>>>
>>> Please refer the comment in JIRA issue [2] to see the numbers from
>>> simple perf. test.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24763
>>>
>>>
>>> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <[email protected]>님이 작성:
>>>
>>>> Ted Yu suggested posting the improved numbers to this thread and I
>>>> think it's good idea, so also posting here, but I also think explaining
>>>> rationalization of my issues would help understanding why I'm submitting
>>>> couple of patches, so I'll explain it first. (Sorry to post a wall of 
>>>> text).
>>>>
>>>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS
>>>> state store provider from 10x~80x of size of state for a batch according to
>>>> various stateful workloads to less than or around 2x. The new option is
>>>> flexible so it can be even around 1x or even effectively disable cache.
>>>> Please refer the comment in the PR [2] to get more details. (hard to post
>>>> detailed numbers in mail format so link a Github comment instead)
>>>>
>>>> I have interest on stateful streaming processing on Structured
>>>> Streaming, and have been learning from codebase as well as analyzing memory
>>>> usage as well as latency (while I admit it is hard to measure latency
>>>> correctly...).
>>>>
>>>>
>>>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>>>
>>>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>>>> excessive caching. As I described in section "The impact of storing
>>>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>>>> multiple versions share the same UnsafeRow unless there's a change on the
>>>> value which lessen the impact of caching multiple versions (credit to Jose
>>>> Torres since I realized it from his comment). But in some workloads which
>>>> lots of writes to state incurs in a batch, the overall memory usage of
>>>> state is going to be out of expectation.
>>>>
>>>> Related patch [3] is also submitted from other contributor (so I'm not
>>>> the one to notice this behavior), whereas the patch might not look enough
>>>> generalized to apply.
>>>>
>>>> First I decided to track the overall memory size of state provider
>>>> cache and expose to UI as well as query status (SPARK-24441 [4]). The
>>>> metric looked like critical and worth to monitor, so I thought it is better
>>>> to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>>>
>>>> Based on adoption of SPARK-24441, I could find more flexible way to
>>>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>>>
>>>> So 3 of 5 issues are coupled so far to track and resolve one issue.
>>>> Hope that it helps explaining worth of reviews for these patches.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>>>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>>>> 3. https://github.com/apache/spark/pull/21500
>>>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>>>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>>>
>>>> ps. Before all mentioned issues I also submitted some other issues
>>>> regarding feature addition/refactor (2 of 5 issues).
>>>>
>>>>
>>>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[email protected]>님이 작성:
>>>>
>>>>> Bump. I have been having hard time working on making additional PRs
>>>>> since some of these rely on non-merged PRs, so spending additional time to
>>>>> decouple these things if possible.
>>>>>
>>>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>>>
>>>>> Pending 5 PRs so far, and may add more sooner or later.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[email protected]>님이 작성:
>>>>>
>>>>>> Kindly reminder since around 2 weeks passed. I've added more PR
>>>>>> during 2 weeks and even planning to do more.
>>>>>>
>>>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[email protected]>님이 작성:
>>>>>>
>>>>>>> Hi Spark devs,
>>>>>>>
>>>>>>> I have couple of pull requests for structured streaming which are
>>>>>>> getting older and fading out from earlier pages in PR pages.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/21469
>>>>>>> https://github.com/apache/spark/pull/21357
>>>>>>> https://github.com/apache/spark/pull/21222
>>>>>>>
>>>>>>> Two of them are in a kind of approval by couple of folks, but no
>>>>>>> approval from committers yet.
>>>>>>> One of them needs rebase and I would be happy to do it after
>>>>>>> reviewing or in progress of reviewing.
>>>>>>>
>>>>>>> Getting reviewed in time would be critical for contributors to be
>>>>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>

Re: Asking for reviewing PRs regarding structured streaming

Reply via email to