Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Wenchen Fan
First of all, I think we all agree that data source v2 API should at least support InternalRow and ColumnarBatch. With this assumption, the current API has 2 problems: *First problem*: We use mixin traits to add support for different data formats. The mixin traits define API to return

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Leif Walsh
I agree we should reuse as much as possible. For PySpark, I think the obvious choices of Breeze and numpy arrays already made make a lot of sense, I’m not sure about the other language bindings and would defer to others. I was under the impression that UDTs were gone and (probably?) not coming

Re: GLM Poisson Model - Deviance calculations

2018-04-18 Thread svattig
Yes i’m referring to that method deviance. It fails when ever y is 0. I think R deviance calculation logic checks if y is 0 and assigns 1 to y for such cases. There are few deviances Like nulldeviance, residualdiviance and deviance that Glm regression summary object has. You might want to check

Re: GLM Poisson Model - Deviance calculations

2018-04-18 Thread Joseph PENG
Are you referring this? override def deviance(y: Double, mu: Double, weight: Double): Double = { 2.0 * weight * (y * math.*log(y / mu)* - (y - mu)) } Not sure how does R handle this, but my guess is they may add a small number, e.g. 0.5, to the numerator and denominator. If you can

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Joseph Torres
The fundamental difficulty seems to be that there's a spurious "round-trip" in the API. Spark inspects the source to determine what type it's going to provide, picks an appropriate method according to that type, and then calls that method on the source to finally get what it wants. Pushing this

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Joseph Bradley
Thanks for the thoughts! We've gone back and forth quite a bit about local linear algebra support in Spark. For reference, there have been some discussions here: https://issues.apache.org/jira/browse/SPARK-6442 https://issues.apache.org/jira/browse/SPARK-16365

Re: Sort-merge join improvement

2018-04-18 Thread Petar Zecevic
As instructed offline, I opened a JIRA for this: https://issues.apache.org/jira/browse/SPARK-24020 I will create a pull request soon. Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit : Hello everybody We (at University of Zagreb and University of Washington) have implemented an optimization of

Re: GLM Poisson Model - Deviance calculations

2018-04-18 Thread Sean Owen
GeneralizedLinearRegression.ylogy seems to handle this case; can you be more specific about where the log(0) happens? that's what should be fixed, right? if so, then a JIRA and PR are the right way to proceed. On Wed, Apr 18, 2018 at 2:37 PM svattig wrote: > In

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Ryan Blue
Wenchen, can you explain a bit more clearly why this is necessary? The pseudo-code you used doesn’t clearly demonstrate why. Why couldn’t this be handled this with inheritance from an abstract Factory class? Why define all of the createXDataReader methods, but make the DataFormat a field in the

GLM Poisson Model - Deviance calculations

2018-04-18 Thread svattig
In Spark 2.3, When Poisson Model(with labelCol having few counts as 0's) is fit, the Deviance calculations are broken as result of log(0). I think this is the same case as in spark 2.2. But the new toString method in Spark 2.3's GeneralizedLinearRegressionTrainingSummary class is throwing error