Both Terry and I will be at the upcoming Hyperspace talk at Spark+AI Europe Summit 2020<https://databricks.com/dataaisummit/europe-2020/agenda> (in less than 7 hrs @ 3:35 AM PST!). Please say hi if you happen to drop by and/or ask us anything! 😊
Thank you! Rahul Potharaju From: Terry Kim <yumin...@gmail.com> Sent: Tuesday, November 17, 2020 4:46 PM To: User <user@spark.apache.org> Subject: [EXTERNAL] Announcing Hyperspace v0.3.0 - an indexing subsystem for Apache Spark™ Hi, We are happy to announce that Hyperspace v0.3.0 - an indexing subsystem for Apache Spark™ - has been just released<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fhyperspace%2Freleases%2Ftag%2Fv0.3.0&data=04%7C01%7Crapoth%40microsoft.com%7C60d6ed64ebea493ecf6408d88b5b61a1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637412571883943684%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yItINucUSOkSbe%2BGym9BtKN2W1RTEU%2FkaJsyFaEsrrg%3D&reserved=0>! Here are the some of the highlights: * Mutable dataset support: Hyperspace v0.3.0 supports mutable dataset where users can append or delete the source data. * Hybrid scan: Prior to v0.3.0, any changes in the original dataset content required a full refresh to make the index usable again, which could be a costly operation. With the Hybrid scan, the existing index can be utilized along with newly appended and/or deleted source files, without explicit refresh operation. Please check out the Hybrid Scan doc<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2Fdocs%2Fug-mutable-dataset%2F%23hybrid-scan&data=04%7C01%7Crapoth%40microsoft.com%7C60d6ed64ebea493ecf6408d88b5b61a1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637412571883943684%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=O9M0qiDQKzn2CZLFYbQZ%2BEPPxH2dngkekPKe%2FIYCt1o%3D&reserved=0> for more detail. * Incremental refresh: v0.3.0 introduces a "incremental" mode to refresh indexes. In this mode, index files are created only for the newly appended source files; deleted source files are also handled by removing them from the existing index files. Please check out the Incremental Refresh doc<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2Fdocs%2Fug-mutable-dataset%2F%23refresh-index---incremental-mode&data=04%7C01%7Crapoth%40microsoft.com%7C60d6ed64ebea493ecf6408d88b5b61a1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637412571883953681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vnkAHrrgfCA2XDSXTqMJ36ldKUzkkwHfP%2FgTBnowlh8%3D&reserved=0> for more detail. * Optimize index: The number of files for indexes can increase due to the incremental refreshes, possibly degrading the performance. The new "optimizeIndex" API optimizes the existing indexes by merging index files to create an optimal number of files. Please check out the Optimize Index doc<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2Fdocs%2Fug-optimize-index%2F&data=04%7C01%7Crapoth%40microsoft.com%7C60d6ed64ebea493ecf6408d88b5b61a1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637412571883963678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lB3vnegV%2FG8HZnKVRVPo5Q3B8bCGMyCdmT7T2Za4Log%3D&reserved=0> for more detail. We would like to thank the community for the great feedback and all those who contributed to this release. Thanks, Terry Kim on behalf of the Hyperspace team