Re: Benchmarks for Many-to-Many Joins

2021-04-21 Thread waltercai
Hi Dhruv, One option is the join order benchmark ; it has become very popular in DB research over the past couple years and features many-many joins. Another option is crafting many-many queries from graph datasets like social media or travel

modifying spark's optimizer for research

2021-04-21 Thread Walter Cai
Hi, I'm Walter, a PhD student at the University of Washington. My goal is to implement a prototype modification to spark's optimizer to exhibit/experiment some of my PhD work. I was hoping to set up a chat with somebody who is familiar with catalyst and the best place to start modifying. Thanks,

Re: mvn auto-downloading on fresh clone

2021-04-21 Thread Sean Owen
I agree, it looks like the automatic redirector has changed behavior. It still sends you to an HTML page for the mirror, but previously that link would cause it to redirect straight to the download. While the script can fallback to archive.apache.org, it doesn't because the HTML downloads

mvn auto-downloading on fresh clone

2021-04-21 Thread Bruce Robbins
Is it just me, or does the auto-download of maven on a fresh Spark clone no longer work? It looks like https://www.apache.org/dyn/closer.lua?action=download= is not functioning anymore (or for the moment) for any piece of Apache software. I noted this in

Benchmarks for Many-to-Many Joins

2021-04-21 Thread Dhruv Kumar
Hi I wanted to ask if anyone knows any datasets or benchmarks which I can use for evaluating many-to-many joins (as depicted in the attached snapshot). I looked at TPC-H and TPC-DS benchmarks but surprisingly, they mostly have one-to-many

Re: [DISCUSS] Add error IDs

2021-04-21 Thread Wenchen Fan
I think severity makes sense for logs, but not sure about errors. +1 to the proposal to improve the error message further. On Fri, Apr 16, 2021 at 6:01 PM Yuming Wang wrote: > +1 for this proposal. > > On Fri, Apr 16, 2021 at 5:15 AM Karen wrote: > >> We could leave space in the numbering