Thanks a lot, Holden. @Liang-Chi Hsieh did you try to run https://gist.github.com/geoHeil/6a23d18ccec085d486165089f9f430f2 for me that is crashing in either line 51 or 58. Holden described the problem pretty well. Ist it clear for you now?
Cheers, Georg Holden Karau [via Apache Spark Developers List] < ml-node+s1001551n20516...@n3.nabble.com> schrieb am Mo., 9. Jan. 2017 um 06:40 Uhr: > Hi Georg, > > Thanks for the question along with the code (as well as posting to stack > overflow). In general if a question is well suited for stackoverflow its > probably better suited to the user@ list instead of the dev@ list so I've > cc'd the user@ list for you. > > As far as handling empty partitions when working mapPartitions (and > similar), the general approach is to return an empty iterator of the > correct type when you have an empty input iterator. > > It looks like your code is doing this, however it seems like you likely > have a bug in your application logic (namely it assumes that if a partition > has a record missing a value it will either have had a previous row in the > same partition which is good OR that the previous partition is not empty > and has a good row - which need not necessarily be the case). You've > partially fixed this problem by going through and for each partition > collecting the last previous good value, and then if you don't have a good > value at the start of a partition look up the value in the collected array. > > However, if this also happens at the same time the previous partition is > empty, you will need to go and lookup the previous previous partition value > until you find the one you are looking for. (Note this assumes that the > first record in your dataset is valid, if it isn't your code will still > fail). > > Your solution is really close to working but just has some minor > assumptions which don't always necessarily hold. > > Cheers, > > Holden :) > On Sun, Jan 8, 2017 at 8:30 PM, Liang-Chi Hsieh <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=20516&i=0>> wrote: > > > Hi Georg, > > Can you describe your question more clear? > > Actually, the example codes you posted in stackoverflow doesn't crash as > you > said in the post. > > > geoHeil wrote > > I am working on building a custom ML pipeline-model / estimator to impute > > missing values, e.g. I want to fill with last good known value. > > Using a window function is slow / will put the data into a single > > partition. > > I built some sample code to use the RDD API however, it some None / null > > problems with empty partitions. > > > > How should this be implemented properly to handle such empty partitions? > > > http://stackoverflow.com/questions/41474175/spark-mappartitionswithindex-handling-empty-partitions > > > > Kind regards, > > Georg > > > > > > ----- > > > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496p20515.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > > To unsubscribe e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=20516&i=1> > > > > > -- > Cell : 425-233-8271 <(425)%20233-8271> > Twitter: https://twitter.com/holdenkarau > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496p20516.html > To unsubscribe from handling of empty partitions, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=20496&code=Z2Vvcmcua2YuaGVpbGVyQGdtYWlsLmNvbXwyMDQ5NnwtMTgzMzc4NTU4MQ==> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496p20518.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.