I'd like to bump this again, since only one of 6 pull requests is merged (5 remaining), and others are not reviewed (non code style) from committers.
https://github.com/apache/spark/pulls/HeartSaVioR All pull requests are related to Structured Streaming, and most of all are already reviewed by couple of contributors. They're open for 17 days at least and more than 2 months at most. I'm not persuading committers to merge them in 2.4, but hope to see any reactions / reviews so that I can hopefully reflect and take them forward to be ready to merge. - Jungtaek Lim (HeartSaVioR) 2018년 7월 16일 (월) 오후 1:04, Jungtaek Lim <kabh...@gmail.com>님이 작성: > Bump. I got couple of review comments from contributors including soft > LGTM, but still haven't got any (non code style) review from committers, so > technically haven't have any progress to be merged. > > I'm planning to work on adding new feature as well, but it's not easy for > me to concentrate on something with also concerning to maintain 6 existing > pull requests. Merge conflicts would be matter on maintaining, especially > other pull requests (submitted later than my pull requests) are getting > reviewed and merged. > > I'd like to ask any structured streaming related committer to take a look > at pull requests. > > - Jungtaek Lim (HeartSaVioR) > > 2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <kabh...@gmail.com>님이 작성: > >> I recently added more test results to SPARK-24763 [1] which shows that >> the proposal reduces state size according to the ratio of key-value size, >> whereas there's no performance hit and sometimes even slight boost. >> >> Please refer the latest comment in JIRA issue [2] to see the numbers from >> perf. tests. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> 1. https://issues.apache.org/jira/browse/SPARK-24763 >> 2. >> https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367 >> >> >> 2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <kabh...@gmail.com>님이 작성: >> >>> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new >>> option to enable optimization of state size in streaming aggregation >>> without hurting performance. >>> >>> The idea is to remove data for key fields from value which is duplicated >>> between key and value in state row. This requires additional operations >>> like projection and join, but smaller state row would also give performance >>> benefit, which can offset each other. >>> >>> Please refer the comment in JIRA issue [2] to see the numbers from >>> simple perf. test. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> 1. https://issues.apache.org/jira/browse/SPARK-24763 >>> >>> >>> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <kabh...@gmail.com>님이 작성: >>> >>>> Ted Yu suggested posting the improved numbers to this thread and I >>>> think it's good idea, so also posting here, but I also think explaining >>>> rationalization of my issues would help understanding why I'm submitting >>>> couple of patches, so I'll explain it first. (Sorry to post a wall of >>>> text). >>>> >>>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS >>>> state store provider from 10x~80x of size of state for a batch according to >>>> various stateful workloads to less than or around 2x. The new option is >>>> flexible so it can be even around 1x or even effectively disable cache. >>>> Please refer the comment in the PR [2] to get more details. (hard to post >>>> detailed numbers in mail format so link a Github comment instead) >>>> >>>> I have interest on stateful streaming processing on Structured >>>> Streaming, and have been learning from codebase as well as analyzing memory >>>> usage as well as latency (while I admit it is hard to measure latency >>>> correctly...). >>>> >>>> >>>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html >>>> >>>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of >>>> excessive caching. As I described in section "The impact of storing >>>> multiple versions from HDFSBackedStateStoreProvider" in above link, while >>>> multiple versions share the same UnsafeRow unless there's a change on the >>>> value which lessen the impact of caching multiple versions (credit to Jose >>>> Torres since I realized it from his comment). But in some workloads which >>>> lots of writes to state incurs in a batch, the overall memory usage of >>>> state is going to be out of expectation. >>>> >>>> Related patch [3] is also submitted from other contributor (so I'm not >>>> the one to notice this behavior), whereas the patch might not look enough >>>> generalized to apply. >>>> >>>> First I decided to track the overall memory size of state provider >>>> cache and expose to UI as well as query status (SPARK-24441 [4]). The >>>> metric looked like critical and worth to monitor, so I thought it is better >>>> to expose it (and watermark) to Dropwizard (SPARK-24637 [5]). >>>> >>>> Based on adoption of SPARK-24441, I could find more flexible way to >>>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr. >>>> >>>> So 3 of 5 issues are coupled so far to track and resolve one issue. >>>> Hope that it helps explaining worth of reviews for these patches. >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> 1. https://issues.apache.org/jira/browse/SPARK-24717 >>>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576 >>>> 3. https://github.com/apache/spark/pull/21500 >>>> 4. https://issues.apache.org/jira/browse/SPARK-24441 >>>> 5. https://issues.apache.org/jira/browse/SPARK-24637 >>>> >>>> ps. Before all mentioned issues I also submitted some other issues >>>> regarding feature addition/refactor (2 of 5 issues). >>>> >>>> >>>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <kabh...@gmail.com>님이 작성: >>>> >>>>> Bump. I have been having hard time working on making additional PRs >>>>> since some of these rely on non-merged PRs, so spending additional time to >>>>> decouple these things if possible. >>>>> >>>>> https://github.com/apache/spark/pulls/HeartSaVioR >>>>> >>>>> Pending 5 PRs so far, and may add more sooner or later. >>>>> >>>>> Thanks, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <kabh...@gmail.com>님이 작성: >>>>> >>>>>> Kindly reminder since around 2 weeks passed. I've added more PR >>>>>> during 2 weeks and even planning to do more. >>>>>> >>>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <kabh...@gmail.com>님이 작성: >>>>>> >>>>>>> Hi Spark devs, >>>>>>> >>>>>>> I have couple of pull requests for structured streaming which are >>>>>>> getting older and fading out from earlier pages in PR pages. >>>>>>> >>>>>>> https://github.com/apache/spark/pull/21469 >>>>>>> https://github.com/apache/spark/pull/21357 >>>>>>> https://github.com/apache/spark/pull/21222 >>>>>>> >>>>>>> Two of them are in a kind of approval by couple of folks, but no >>>>>>> approval from committers yet. >>>>>>> One of them needs rebase and I would be happy to do it after >>>>>>> reviewing or in progress of reviewing. >>>>>>> >>>>>>> Getting reviewed in time would be critical for contributors to be >>>>>>> honest, so I'd like to ask dev mailing list to review my PRs. >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>> >>>>>>