Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Wenchen Fan
 First of all, I think we all agree that data source v2 API should at least
support InternalRow and ColumnarBatch. With this assumption, the current
API has 2 problems:

*First problem*: We use mixin traits to add support for different data
formats.

The mixin traits define API to return DataReader/WriterFactory for
different formats. It brings a lot of trouble to streaming, as streaming
has its own factory interface, which we don't want it to extend the batch
factory. This means we need to duplicate the mixin traits for batch and
streaming. Keep in mind that duplicating the traits is also a possible
solution, if there is no better way.

Another possible solution is, remove the mixin traits and put all
"createFactory" method in DataSourceReader/Writer, with a new method to
indicate which "createFactory" method Spark should call. Then the API looks
like

interface DataSourceReader {
  DataFormat dataFormat;

  default List createDataReaderFactories() {
throw new IllegalStateException();
  }

  default List
createColumnarBatchDataReaderFactories() {
throw new IllegalStateException();
  }
}

or to be more friendly to people who don't care about columnar format

interface DataSourceReader {
  default DataFormat dataFormat { return DataFormat.INTERNAL_ROW };

  List createDataReaderFactories();

  default List
createColumnarBatchDataReaderFactories()
{
throw new IllegalStateException();
  }
}

This solution still brings some trouble to streaming, as the streaming
specific DataSourceReader needs to re-define all these "createFactory"
methods, but it's much better than duplicating the mixin traits.

*Second problem*: The DataReader/WriterFactory may have a lot of
constructor parameters, it's painful to define different factories with the
same but very long parameter list.
After a closer look, I think this is the major part of the duplicated code.
This is not a strong reason, so it's OK if people don't think it's a
problem. In the meanwhile, I think it might be better to shift the data
format stuff to the factory so that we can support hybrid storage data
source in the future, like I mentioned before.


Finally, we can also consider Joseph's proposal, to remove the type
parameter entirely and get rid of this problem.



On Thu, Apr 19, 2018 at 8:54 AM, Joseph Torres  wrote:

> The fundamental difficulty seems to be that there's a spurious
> "round-trip" in the API. Spark inspects the source to determine what type
> it's going to provide, picks an appropriate method according to that type,
> and then calls that method on the source to finally get what it wants.
> Pushing this out of the DataSourceReader doesn't eliminate this problem; it
> just shifts it. We still need an InternalRow method and a ColumnarBatch
> method and possibly Row and UnsafeRow methods too.
>
> I'd propose it would be better to just accept a bit less type safety here,
> and push the problem all the way down to the DataReader. Make
> DataReader.get() return Object, and document that the runtime type had
> better match the type declared in the reader's DataFormat. Then we can get
> rid of the special Row/UnsafeRow/ColumnarBatch methods cluttering up the
> API, and figure out whether to support Row and UnsafeRow independently of
> all our other API decisions. (I didn't think about this until now, but the
> fact that some orthogonal API decisions have to be conditioned on which set
> of row formats we support seems like a code smell.)
>
> On Wed, Apr 18, 2018 at 3:53 PM, Ryan Blue 
> wrote:
>
>> Wenchen, can you explain a bit more clearly why this is necessary? The
>> pseudo-code you used doesn’t clearly demonstrate why. Why couldn’t this be
>> handled this with inheritance from an abstract Factory class? Why define
>> all of the createXDataReader methods, but make the DataFormat a field in
>> the factory?
>>
>> A related issue is that I think there’s a strong case that the v2 sources
>> should produce only InternalRow and that Row and UnsafeRow shouldn’t be
>> exposed; see SPARK-23325
>> . The basic arguments
>> are:
>>
>>- UnsafeRow is really difficult to produce without using Spark’s
>>projection methods. If implementations can produce UnsafeRow, then
>>they can still pass them as InternalRow and the projection Spark adds
>>would be a no-op. When implementations can’t produce UnsafeRow, then
>>it is better for Spark to insert the projection to unsafe. An example of a
>>data format that doesn’t produce unsafe is the built-in Parquet source,
>>which produces InternalRow and projects before returning the row.
>>- For Row, I see no good reason to support it in a new interface when
>>it will just introduce an extra transformation. The argument that Row
>>is the “public” API doesn’t apply because UnsafeRow is already

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Leif Walsh
I agree we should reuse as much as possible. For PySpark, I think the
obvious choices of Breeze and numpy arrays already made make a lot of
sense, I’m not sure about the other language bindings and would defer to
others.

I was under the impression that UDTs were gone and (probably?) not coming
back. Did I miss something and they’re actually going to be better
supported in the future? I think your second point (about separating
expanding the primitives from expanding SQL support) is only really true if
we’re getting UDTs back.

You’ve obviously seen more of the history here than me. Do you have a sense
of why the efforts you mentioned never went anywhere? I don’t think this is
strictly about “mllib local”, it’s more about generic linalg, so 19653
feels like the closest to what I’m after, but it looks to me like that one
just fizzled out, rather than a real back and forth.

Does this just need something like a persistent product manager to scope
out the effort, champion it, and push it forward?
On Wed, Apr 18, 2018 at 20:02 Joseph Bradley  wrote:

> Thanks for the thoughts!  We've gone back and forth quite a bit about
> local linear algebra support in Spark.  For reference, there have been some
> discussions here:
> https://issues.apache.org/jira/browse/SPARK-6442
> https://issues.apache.org/jira/browse/SPARK-16365
> https://issues.apache.org/jira/browse/SPARK-19653
>
> Overall, I like the idea of improving linear algebra support, especially
> given the rise of Python numerical processing & deep learning.  But some
> considerations I'd list include:
> * There are great linear algebra libraries out there, and it would be
> ideal to reuse those as much as possible.
> * SQL support for linear algebra can be a separate effort from expanding
> linear algebra primitives.
> * It would be valuable to discuss external types as UDTs (which can be
> hacked with numpy and scipy types now) vs. adding linear algebra types to
> native Spark SQL.
>
>
> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh  wrote:
>
>> Hi all,
>>
>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
>> I’ve found myself wanting more.
>>
>> There is a minor issue in that with the arrow serialization enabled,
>> these types don’t serialize properly in python UDF calls or in toPandas.
>> There’s a natural representation for them in numpy.ndarray, and I’ve
>> started a conversation with the arrow community about supporting
>> tensor-valued columns, but that might be a ways out. In the meantime, I
>> think we can fix this by using the FixedSizeBinary column type in arrow,
>> together with some metadata describing the tensor shape (list of dimension
>> sizes).
>>
>> The larger issue, for which I intend to submit an SPIP soon, is that
>> these types could be better supported at the API layer, regardless of
>> serialization. In the limit, we could consider the entire numpy ndarray
>> surface area as a target. At the minimum, what I’m thinking is that these
>> types should support column operations like matrix multiply, transpose,
>> inner and outer product, etc., and maybe have a more ergonomic construction
>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
>> VectorAssembler API is kind of clunky.
>>
>> One possibility here is to restrict the tensor column types such that
>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>> for operations to check validity before execution, for example, a matrix
>> multiply could check dimension match and fail fast. However, there might be
>> use cases for a column to contain variable shape tensors, I’m open to
>> discussion here.
>>
>> What do you all think?
>> --
>> --
>> Cheers,
>> Leif
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>
-- 
-- 
Cheers,
Leif


Re: GLM Poisson Model - Deviance calculations

2018-04-18 Thread svattig
Yes i’m referring to that method deviance. It fails when ever y is 0. I think
R deviance calculation logic checks if y is 0 and assigns 1 to y for such
cases.

There are few deviances Like nulldeviance, residualdiviance and deviance
that Glm regression summary object has.
You might want to check those as well so the toString method doesn’t fail.

Thank you,
Srikar.V



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: GLM Poisson Model - Deviance calculations

2018-04-18 Thread Joseph PENG
Are you referring this?

   override def deviance(y: Double, mu: Double, weight: Double): Double = {
  2.0 * weight * (y * math.*log(y / mu)* - (y - mu))
}

Not sure how does R handle this, but my guess is they may add a small
number, e.g. 0.5, to the numerator and denominator. If you can confirm
that's the issue, I will look into it.

On Wed, Apr 18, 2018 at 6:46 PM, Sean Owen  wrote:

> GeneralizedLinearRegression.ylogy seems to handle this case; can you be
> more specific about where the log(0) happens? that's what should be fixed,
> right? if so, then a JIRA and PR are the right way to proceed.
>
> On Wed, Apr 18, 2018 at 2:37 PM svattig 
> wrote:
>
>> In Spark 2.3, When Poisson Model(with labelCol having few counts as 0's)
>> is
>> fit, the Deviance calculations are broken as result of log(0). I think
>> this
>> is the same case as in spark 2.2.
>> But the new toString method in Spark 2.3's
>> GeneralizedLinearRegressionTrainingSummary class is throwing error at
>> line
>> 1551 with NumberFormatException. Due to this exception, we are not able to
>> get the summary object from Model fit.
>>
>> Can the toString method be fixed including Deviance calculations for
>> example
>> taking log(1) when ever the count is 0 instead of having log(0) ?
>>
>> Thanks,
>> Srikar.V
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Joseph Torres
The fundamental difficulty seems to be that there's a spurious "round-trip"
in the API. Spark inspects the source to determine what type it's going to
provide, picks an appropriate method according to that type, and then calls
that method on the source to finally get what it wants. Pushing this out of
the DataSourceReader doesn't eliminate this problem; it just shifts it. We
still need an InternalRow method and a ColumnarBatch method and possibly
Row and UnsafeRow methods too.

I'd propose it would be better to just accept a bit less type safety here,
and push the problem all the way down to the DataReader. Make
DataReader.get() return Object, and document that the runtime type had
better match the type declared in the reader's DataFormat. Then we can get
rid of the special Row/UnsafeRow/ColumnarBatch methods cluttering up the
API, and figure out whether to support Row and UnsafeRow independently of
all our other API decisions. (I didn't think about this until now, but the
fact that some orthogonal API decisions have to be conditioned on which set
of row formats we support seems like a code smell.)

On Wed, Apr 18, 2018 at 3:53 PM, Ryan Blue 
wrote:

> Wenchen, can you explain a bit more clearly why this is necessary? The
> pseudo-code you used doesn’t clearly demonstrate why. Why couldn’t this be
> handled this with inheritance from an abstract Factory class? Why define
> all of the createXDataReader methods, but make the DataFormat a field in
> the factory?
>
> A related issue is that I think there’s a strong case that the v2 sources
> should produce only InternalRow and that Row and UnsafeRow shouldn’t be
> exposed; see SPARK-23325
> . The basic arguments
> are:
>
>- UnsafeRow is really difficult to produce without using Spark’s
>projection methods. If implementations can produce UnsafeRow, then
>they can still pass them as InternalRow and the projection Spark adds
>would be a no-op. When implementations can’t produce UnsafeRow, then
>it is better for Spark to insert the projection to unsafe. An example of a
>data format that doesn’t produce unsafe is the built-in Parquet source,
>which produces InternalRow and projects before returning the row.
>- For Row, I see no good reason to support it in a new interface when
>it will just introduce an extra transformation. The argument that Row
>is the “public” API doesn’t apply because UnsafeRow is already exposed
>through the v2 API.
>- Standardizing on InternalRow would remove the need for these
>interfaces entirely and simplify what implementers must provide and would
>reduce confusion over what to do.
>
> Using InternalRow doesn’t cover the case where we want to produce
> ColumnarBatch instead, so what you’re proposing might still be a good
> idea. I just think that we can simplify either path.
> ​
>
> On Mon, Apr 16, 2018 at 11:17 PM, Wenchen Fan  wrote:
>
>> Yea definitely not. The only requirement is, the DataReader/WriterFactory
>> must support at least one DataFormat.
>>
>> >  how are we going to express capability of the given reader of its
>> supported format(s), or specific support for each of “real-time data in row
>> format, and history data in columnar format”?
>>
>> When DataSourceReader/Writer create factories, the factory must contain
>> enough information to decide the data format. Let's take ORC as an example.
>> In OrcReaderFactory, it knows which files to read, and which columns to
>> output. Since now Spark only support columnar scan for simple types,
>> OrcReaderFactory will only output ColumnarBatch if the columns to scan
>> are all simple types.
>>
>> On Tue, Apr 17, 2018 at 11:38 AM, Felix Cheung > > wrote:
>>
>>> Is it required for DataReader to support all known DataFormat?
>>>
>>> Hopefully, not, as assumed by the ‘throw’ in the interface. Then
>>> specifically how are we going to express capability of the given reader of
>>> its supported format(s), or specific support for each of “real-time data in
>>> row format, and history data in columnar format”?
>>>
>>>
>>> --
>>> *From:* Wenchen Fan 
>>> *Sent:* Sunday, April 15, 2018 7:45:01 PM
>>> *To:* Spark dev list
>>> *Subject:* [discuss][data source v2] remove type parameter in
>>> DataReader/WriterFactory
>>>
>>> Hi all,
>>>
>>> I'd like to propose an API change to the data source v2.
>>>
>>> One design goal of data source v2 is API type safety. The FileFormat API
>>> is a bad example, it asks the implementation to return InternalRow even
>>> it's actually ColumnarBatch. In data source v2 we add a type parameter
>>> to DataReader/WriterFactoty and DataReader/Writer, so that data source
>>> supporting columnar scan returns ColumnarBatch at API level.
>>>
>>> However, we met some problems when migrating streaming and file-based
>>> data source to data 

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Joseph Bradley
Thanks for the thoughts!  We've gone back and forth quite a bit about local
linear algebra support in Spark.  For reference, there have been some
discussions here:
https://issues.apache.org/jira/browse/SPARK-6442
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-19653

Overall, I like the idea of improving linear algebra support, especially
given the rise of Python numerical processing & deep learning.  But some
considerations I'd list include:
* There are great linear algebra libraries out there, and it would be ideal
to reuse those as much as possible.
* SQL support for linear algebra can be a separate effort from expanding
linear algebra primitives.
* It would be valuable to discuss external types as UDTs (which can be
hacked with numpy and scipy types now) vs. adding linear algebra types to
native Spark SQL.


On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh  wrote:

> Hi all,
>
> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
> I’ve found myself wanting more.
>
> There is a minor issue in that with the arrow serialization enabled, these
> types don’t serialize properly in python UDF calls or in toPandas. There’s
> a natural representation for them in numpy.ndarray, and I’ve started a
> conversation with the arrow community about supporting tensor-valued
> columns, but that might be a ways out. In the meantime, I think we can fix
> this by using the FixedSizeBinary column type in arrow, together with some
> metadata describing the tensor shape (list of dimension sizes).
>
> The larger issue, for which I intend to submit an SPIP soon, is that these
> types could be better supported at the API layer, regardless of
> serialization. In the limit, we could consider the entire numpy ndarray
> surface area as a target. At the minimum, what I’m thinking is that these
> types should support column operations like matrix multiply, transpose,
> inner and outer product, etc., and maybe have a more ergonomic construction
> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
> VectorAssembler API is kind of clunky.
>
> One possibility here is to restrict the tensor column types such that
> every value must have the same shape, e.g. a 2x2 matrix. This would allow
> for operations to check validity before execution, for example, a matrix
> multiply could check dimension match and fail fast. However, there might be
> use cases for a column to contain variable shape tensors, I’m open to
> discussion here.
>
> What do you all think?
> --
> --
> Cheers,
> Leif
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: Sort-merge join improvement

2018-04-18 Thread Petar Zecevic

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: GLM Poisson Model - Deviance calculations

2018-04-18 Thread Sean Owen
GeneralizedLinearRegression.ylogy seems to handle this case; can you be
more specific about where the log(0) happens? that's what should be fixed,
right? if so, then a JIRA and PR are the right way to proceed.

On Wed, Apr 18, 2018 at 2:37 PM svattig  wrote:

> In Spark 2.3, When Poisson Model(with labelCol having few counts as 0's) is
> fit, the Deviance calculations are broken as result of log(0). I think this
> is the same case as in spark 2.2.
> But the new toString method in Spark 2.3's
> GeneralizedLinearRegressionTrainingSummary class is throwing error at line
> 1551 with NumberFormatException. Due to this exception, we are not able to
> get the summary object from Model fit.
>
> Can the toString method be fixed including Deviance calculations for
> example
> taking log(1) when ever the count is 0 instead of having log(0) ?
>
> Thanks,
> Srikar.V
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Ryan Blue
Wenchen, can you explain a bit more clearly why this is necessary? The
pseudo-code you used doesn’t clearly demonstrate why. Why couldn’t this be
handled this with inheritance from an abstract Factory class? Why define
all of the createXDataReader methods, but make the DataFormat a field in
the factory?

A related issue is that I think there’s a strong case that the v2 sources
should produce only InternalRow and that Row and UnsafeRow shouldn’t be
exposed; see SPARK-23325 .
The basic arguments are:

   - UnsafeRow is really difficult to produce without using Spark’s
   projection methods. If implementations can produce UnsafeRow, then they
   can still pass them as InternalRow and the projection Spark adds would
   be a no-op. When implementations can’t produce UnsafeRow, then it is
   better for Spark to insert the projection to unsafe. An example of a data
   format that doesn’t produce unsafe is the built-in Parquet source, which
   produces InternalRow and projects before returning the row.
   - For Row, I see no good reason to support it in a new interface when it
   will just introduce an extra transformation. The argument that Row is
   the “public” API doesn’t apply because UnsafeRow is already exposed
   through the v2 API.
   - Standardizing on InternalRow would remove the need for these
   interfaces entirely and simplify what implementers must provide and would
   reduce confusion over what to do.

Using InternalRow doesn’t cover the case where we want to produce
ColumnarBatch instead, so what you’re proposing might still be a good idea.
I just think that we can simplify either path.
​

On Mon, Apr 16, 2018 at 11:17 PM, Wenchen Fan  wrote:

> Yea definitely not. The only requirement is, the DataReader/WriterFactory
> must support at least one DataFormat.
>
> >  how are we going to express capability of the given reader of its
> supported format(s), or specific support for each of “real-time data in row
> format, and history data in columnar format”?
>
> When DataSourceReader/Writer create factories, the factory must contain
> enough information to decide the data format. Let's take ORC as an example.
> In OrcReaderFactory, it knows which files to read, and which columns to
> output. Since now Spark only support columnar scan for simple types,
> OrcReaderFactory will only output ColumnarBatch if the columns to scan
> are all simple types.
>
> On Tue, Apr 17, 2018 at 11:38 AM, Felix Cheung 
> wrote:
>
>> Is it required for DataReader to support all known DataFormat?
>>
>> Hopefully, not, as assumed by the ‘throw’ in the interface. Then
>> specifically how are we going to express capability of the given reader of
>> its supported format(s), or specific support for each of “real-time data in
>> row format, and history data in columnar format”?
>>
>>
>> --
>> *From:* Wenchen Fan 
>> *Sent:* Sunday, April 15, 2018 7:45:01 PM
>> *To:* Spark dev list
>> *Subject:* [discuss][data source v2] remove type parameter in
>> DataReader/WriterFactory
>>
>> Hi all,
>>
>> I'd like to propose an API change to the data source v2.
>>
>> One design goal of data source v2 is API type safety. The FileFormat API
>> is a bad example, it asks the implementation to return InternalRow even
>> it's actually ColumnarBatch. In data source v2 we add a type parameter
>> to DataReader/WriterFactoty and DataReader/Writer, so that data source
>> supporting columnar scan returns ColumnarBatch at API level.
>>
>> However, we met some problems when migrating streaming and file-based
>> data source to data source v2.
>>
>> For the streaming side, we need a variant of DataReader/WriterFactory to
>> add streaming specific concept like epoch id and offset. For details please
>> see ContinuousDataReaderFactory and https://docs.google.com/do
>> cument/d/1PJYfb68s2AG7joRWbhrgpEWhrsPqbhyRwUVl9V1wPOE/edit#
>>
>> But this conflicts with the special format mixin traits like
>> SupportsScanColumnarBatch. We have to make the streaming variant of
>> DataReader/WriterFactory to extend the original DataReader/WriterFactory,
>> and do type cast at runtime, which is unnecessary and violate the type
>> safety.
>>
>> For the file-based data source side, we have a problem with code
>> duplication. Let's take ORC data source as an example. To support both
>> unsafe row and columnar batch scan, we need something like
>>
>> // A lot of parameters to carry to the executor side
>> class OrcUnsafeRowFactory(...) extends DataReaderFactory[UnsafeRow] {
>>   def createDataReader ...
>> }
>>
>> class OrcColumnarBatchFactory(...) extends DataReaderFactory[ColumnarBatch]
>> {
>>   def createDataReader ...
>> }
>>
>> class OrcDataSourceReader extends DataSourceReader {
>>   def createUnsafeRowFactories = ... // logic to prepare the parameters
>> and create factories
>>
>>   def createColumnarBatchFactories = ... // logic 

GLM Poisson Model - Deviance calculations

2018-04-18 Thread svattig
In Spark 2.3, When Poisson Model(with labelCol having few counts as 0's) is
fit, the Deviance calculations are broken as result of log(0). I think this
is the same case as in spark 2.2. 
But the new toString method in Spark 2.3's
GeneralizedLinearRegressionTrainingSummary class is throwing error at line
1551 with NumberFormatException. Due to this exception, we are not able to
get the summary object from Model fit.

Can the toString method be fixed including Deviance calculations for example
taking log(1) when ever the count is 0 instead of having log(0) ?

Thanks,
Srikar.V



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org