Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Daniel Siegmann
Any chance of back-porting

SPARK-14536  - NPE in
JDBCRDD when array column contains nulls (postgresql)

It just adds a null check - just a simple bug fix - so it really belongs in
Spark 2.1.x.

On Mon, Mar 20, 2017 at 6:12 PM, Holden Karau  wrote:

> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522 
>  - --executor-memory flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell - https://github.com/apache/sp
> ark/pull/16906 PR exists but its difficult to add automated tests for
> this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759 
>  - ALSModel.predict on Dataframes : potential optimization by not using
> blas - No PR consider re-targeting unless someone has a PR waiting in the
> wings?
>
> Explicitly targeted issues are marked with a *, the remaining issues are
> listed as impacting 2.1.1 and don't have a specific target version set.
>
> Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
> blocker in SQL( SPARK-19983
>  ),
>
> Query string is:
>
> affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND
> resolution = Unresolved AND priority = targetPriority
>
> Continuing on for unresolved 2.1.0 issues in Major there are 163 (76 of
> them in progress), 65 Minor (26 in progress), and 9 trivial (6 in progress).
>
> I'll be going through the 2.1.0 major issues with open PRs that 

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Daniel Siegmann
After support is dropped for Java 7, can we have encoders for java.time
classes (e.g. LocalDate)? If so, then please drop support for Java 7 ASAP.
:-)


Re: Spark ML - Scaling logistic regression for many features

2016-04-28 Thread Daniel Siegmann
FYI: https://issues.apache.org/jira/browse/SPARK-14464

I have submitted a PR as well.

On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> No, I didn't yet - feel free to create a JIRA.
>
>
>
> On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegm...@teamaol.com>
> wrote:
>
>> Hi Nick,
>>
>> Thanks again for your help with this. Did you create a ticket in JIRA for
>> investigating sparse models in LR and / or multivariate summariser? If so,
>> can you give me the issue key(s)? If not, would you like me to create these
>> tickets?
>>
>> I'm going to look into this some more and see if I can figure out how to
>> implement these fixes.
>>
>> ~Daniel Siegmann
>>
>> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> Also adding dev list in case anyone else has ideas / views.
>>>
>>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the feedback.
>>>>
>>>> I think Spark can certainly meet your use case when your data size
>>>> scales up, as the actual model dimension is very small - you will need to
>>>> use those indexers or some other mapping mechanism.
>>>>
>>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>>> is supported but I have to check that). That will help use spark models in
>>>> serving environments.
>>>>
>>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>>> also a ticket for multivariate summariser (though I don't think in practice
>>>> there will be much to gain).
>>>>
>>>>
>>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>>> daniel.siegm...@teamaol.com> wrote:
>>>>
>>>>> Thanks for the pointer to those indexers, those are some good
>>>>> examples. A good way to go for the trainer and any scoring done in Spark. 
>>>>> I
>>>>> will definitely have to deal with scoring in non-Spark systems though.
>>>>>
>>>>> I think I will need to scale up beyond what single-node liblinear can
>>>>> practically provide. The system will need to handle much larger 
>>>>> sub-samples
>>>>> of this data (and other projects might be larger still). Additionally, the
>>>>> system needs to train many models in parallel (hyper-parameter 
>>>>> optimization
>>>>> with n-fold cross-validation, multiple algorithms, different sets of
>>>>> features).
>>>>>
>>>>> Still, I suppose we'll have to consider whether Spark is the best
>>>>> system for this. For now though, my job is to see what can be achieved 
>>>>> with
>>>>> Spark.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> Ok, I think I understand things better now.
>>>>>>
>>>>>> For Spark's current implementation, you would need to map those
>>>>>> features as you mention. You could also use say StringIndexer ->
>>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>>>> the mapping and training (e.g.
>>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>>>> Pipeline supports persistence.
>>>>>>
>>>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>>>> serving app which is often not ideal. If you're doing bulk scoring 
>>>>>> offline,
>>>>>> then it may suit.
>>>>>>
>>>>>> Honestly though, for that data size I'd certainly go with something
>>>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>>>> examples for very large scale problems. However there are definitely
>>>>>> limitations on model dimension and sparse weight vectors currently. There
>>>>>> are potential solutions to these but they haven't been implemented as 
>>&g

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-11 Thread Daniel Siegmann
On Wed, Apr 6, 2016 at 2:57 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

... My concern is that either of those options will take more resources
> than some Spark users will have available in the ~3 months remaining before
> Spark 2.0.0, which will cause fragmentation into Spark 1.x and Spark 2.x
> user communities. ...
>

It's not as if everyone is going to switch over to Spark 2.0.0 on release
day anyway. It's not that unusual to see posts on the user list from people
who are a version or two behind. I think a few extra months lag time will
be OK for a major version.

Besides, in my experience if you give people more time to upgrade, they're
just going to kick the can down the road a ways and you'll eventually end
up with the same problem. I don't see a good reason to *not* drop Java 7
and Scala 2.10 support with Spark 2.0.0. Time to bite the bullet. If
companies stick with Spark 1.x and find themselves missing the new features
in the 2.x line, that will be a good motivation for them to upgrade.

~Daniel Siegmann


Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Daniel Siegmann
Hi Nick,

Thanks again for your help with this. Did you create a ticket in JIRA for
investigating sparse models in LR and / or multivariate summariser? If so,
can you give me the issue key(s)? If not, would you like me to create these
tickets?

I'm going to look into this some more and see if I can figure out how to
implement these fixes.

~Daniel Siegmann

On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Also adding dev list in case anyone else has ideas / views.
>
> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Thanks for the feedback.
>>
>> I think Spark can certainly meet your use case when your data size scales
>> up, as the actual model dimension is very small - you will need to use
>> those indexers or some other mapping mechanism.
>>
>> There is ongoing work for Spark 2.0 to make it easier to use models
>> outside of Spark - also see PMML export (I think mllib logistic regression
>> is supported but I have to check that). That will help use spark models in
>> serving environments.
>>
>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>> also a ticket for multivariate summariser (though I don't think in practice
>> there will be much to gain).
>>
>>
>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>> daniel.siegm...@teamaol.com> wrote:
>>
>>> Thanks for the pointer to those indexers, those are some good examples.
>>> A good way to go for the trainer and any scoring done in Spark. I will
>>> definitely have to deal with scoring in non-Spark systems though.
>>>
>>> I think I will need to scale up beyond what single-node liblinear can
>>> practically provide. The system will need to handle much larger sub-samples
>>> of this data (and other projects might be larger still). Additionally, the
>>> system needs to train many models in parallel (hyper-parameter optimization
>>> with n-fold cross-validation, multiple algorithms, different sets of
>>> features).
>>>
>>> Still, I suppose we'll have to consider whether Spark is the best system
>>> for this. For now though, my job is to see what can be achieved with Spark.
>>>
>>>
>>>
>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Ok, I think I understand things better now.
>>>>
>>>> For Spark's current implementation, you would need to map those
>>>> features as you mention. You could also use say StringIndexer ->
>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>> the mapping and training (e.g.
>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>> Pipeline supports persistence.
>>>>
>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>>> then it may suit.
>>>>
>>>> Honestly though, for that data size I'd certainly go with something
>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>> examples for very large scale problems. However there are definitely
>>>> limitations on model dimension and sparse weight vectors currently. There
>>>> are potential solutions to these but they haven't been implemented as yet.
>>>>
>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>> daniel.siegm...@teamaol.com> wrote:
>>>>
>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> Would you mind letting us know the # training examples in the
>>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>>> categorical
>>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>>> million. How are your constructing your feature vectors to get a 20 
>>>>>> million
>>>>>> size? The only realistic way I can see this situation occurring in 
>>>>>> practice
>>>>>> is with feature hashing (HashingTF).
>>>>>>
>>>>>
>>>>> T