Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yan Facai
Hi, jinhong.
Do you use `setRegParam`, which is 0.0 by default ?


Both elasticNetParam and regParam are required if regularization is need.

val regParamL1 = $(elasticNetParam) * $(regParam)
val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)




On Mon, Mar 20, 2017 at 6:31 PM, Yanbo Liang  wrote:

> Do you want to get sparse model that most of the coefficients are zeros?
> If yes, using L1 regularization leads to sparsity. But the
> LogisticRegressionModel coefficients vector's size is still equal with the
> number of features, you can get the non-zero elements manually. Actually,
> it would be a sparse vector (or matrix for multinomial case) if it's sparse
> enough.
>
> Thanks
> Yanbo
>
> On Sun, Mar 19, 2017 at 5:02 AM, Dhanesh Padmanabhan <
> dhanesh12...@gmail.com> wrote:
>
>> It shouldn't be difficult to convert the coefficients to a sparse vector.
>> Not sure if that is what you are looking for
>>
>> -Dhanesh
>>
>> On Sun, Mar 19, 2017 at 5:02 PM jinhong lu  wrote:
>>
>> Thanks Dhanesh,  and how about the features question?
>>
>> 在 2017年3月19日,19:08,Dhanesh Padmanabhan  写道:
>>
>> Dhanesh
>>
>>
>> Thanks,
>> lujinhong
>>
>> --
>> Dhanesh
>> +91-9741125245 <+91%2097411%2025245>
>>
>
>


Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Holden Karau
I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
Maybe we can get TDs input on it?

On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:

> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
> blocker
>
> Best,
>
> Nan
>
> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
> wrote:
>
> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522  - 
> --executor-memory
> flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell -
> https://github.com/apache/spark/pull/16906 PR exists but its difficult to
> add automated tests for this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759  - 
> ALSModel.predict
> on Dataframes : potential optimization by not using blas - No PR consider
> re-targeting unless someone has a PR waiting in the wings?
>
> Explicitly targeted issues are marked with a *, the remaining issues are
> listed as impacting 2.1.1 and don't have a specific target version set.
>
> Since 2.1.1 continues the 2.1.0 branch, looking at 

Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Nan Zhu
I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
blocker

Best,

Nan

On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
wrote:

> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522 
>  - --executor-memory flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell - https://github.com/apache/sp
> ark/pull/16906 PR exists but its difficult to add automated tests for
> this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759 
>  - ALSModel.predict on Dataframes : potential optimization by not using
> blas - No PR consider re-targeting unless someone has a PR waiting in the
> wings?
>
> Explicitly targeted issues are marked with a *, the remaining issues are
> listed as impacting 2.1.1 and don't have a specific target version set.
>
> Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
> blocker in SQL( SPARK-19983
>  ),
>
> Query string is:
>
> affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = 

Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Felix Cheung
I've been scrubbing R and think we are tracking 2 issues


https://issues.apache.org/jira/browse/SPARK-19237


https://issues.apache.org/jira/browse/SPARK-19925




From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Monday, March 20, 2017 3:12:35 PM
To: dev@spark.apache.org
Subject: Outstanding Spark 2.1.1 issues

Hi Spark Developers!

As we start working on the Spark 2.1.1 release I've been looking at our 
outstanding issues still targeted for it. I've tried to break it down by 
component so that people in charge of each component can take a quick look and 
see if any of these things can/should be re-targeted to 2.2 or 2.1.2 & the 
overall list is pretty short (only 9 items - 5 if we only look at explicitly 
tagged) :)

If your working on something for Spark 2.1.1 and it doesn't show up in this 
list please speak up now :) We have a lot of issues (including "in progress") 
that are listed as impacting 2.1.0, but they aren't targeted for 2.1.1 - if 
there is something you are working in their which should be targeted for 2.1.1 
please let us know so it doesn't slip through the cracks.

The query string I used for looking at the 2.1.1 open issues is:

((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1 OR 
cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved ORDER 
BY priority DESC

None of the open issues appear to be a regression from 2.1.0, but those seem 
more likely to show up during the RC process (thanks in advance to everyone 
testing their workloads :)) & generally none of them seem to be

(Note: the cfs are for Target Version/s field)

Critical Issues:
 SQL:
  SPARK-19690 - Join a 
streaming DataFrame with a batch DataFrame may not work - PR 
https://github.com/apache/spark/pull/17052 (review in progress by zsxwing, 
currently failing Jenkins)*

Major Issues:
 SQL:
  SPARK-19035 - rand() 
function in case when cause failed - no outstanding PR (consensus on JIRA seems 
to be leaning towards it being a real issue but not necessarily everyone agrees 
just yet - maybe we should slip this?)*
 Deploy:
  SPARK-19522 - 
--executor-memory flag doesn't work in local-cluster mode - 
https://github.com/apache/spark/pull/16975 (review in progress by vanzin, but 
PR currently stalled waiting on response) *
 Core:
  SPARK-20025 - Driver fail 
over will not work, if SPARK_LOCAL* env is set. - 
https://github.com/apache/spark/pull/17357 (waiting on review) *
 PySpark:
 SPARK-19955 - Update 
run-tests to support conda [ Part of Dropping 2.6 support -- which we shouldn't 
do in a minor release -- but also fixes pip installability tests to run in 
Jenkins ]-  PR failing Jenkins (I need to poke this some more, but seems like 
2.7 support works but some other issues. Maybe slip to 2.2?)

Minor issues:
 Tests:
  SPARK-19612 - Tests 
failing with timeout - No PR per-se but it seems unrelated to the 2.1.1 
release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd 
consider explicitly targeting this for 2.2?
 PySpark:
  SPARK-19570 - Allow to 
disable hive in pyspark shell - https://github.com/apache/spark/pull/16906 PR 
exists but its difficult to add automated tests for this (although if 
SPARK-19955 gets in would 
make testing this easier) - no reviewers yet. Possible re-target?*
 Structured Streaming:
  SPARK-19613 - Flaky test: 
StateStoreRDDSuite.versioning and immutability - It's not targetted for 2.1.1 
but listed as affecting 2.1.1 - I'd consider explicitly targeting this for 2.2?
 ML:
  SPARK-19759 - 
ALSModel.predict on Dataframes : potential optimization by not using blas - No 
PR consider re-targeting unless someone has a PR waiting in the wings?

Explicitly targeted issues are marked with a *, the remaining issues are listed 
as impacting 2.1.1 and don't have a specific target version set.

Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open blocker 
in SQL( SPARK-19983 ),

Query string is:

affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND 
resolution = Unresolved AND priority = targetPriority

Continuing on for unresolved 2.1.0 issues in Major there are 163 (76 of them in 
progress), 65 Minor (26 in progress), and 9 trivial (6 in progress).

I'll be going through the 2.1.0 major issues with open PRs that impact the 
PySpark component and seeing if any of them should be 

Re: Why are DataFrames always read with nullable=True?

2017-03-20 Thread Kazuaki Ishizaki
Hi,
Regarding reading part for nullable, it seems to be considered to add a 
data cleaning step as Xiao said at 
https://www.mail-archive.com/user@spark.apache.org/msg39233.html.

Here is a PR https://github.com/apache/spark/pull/17293 to add the data 
cleaning step that throws an exception if null exists in non-null column.
Any comments are appreciated.

Kazuaki Ishizaki



From:   Jason White 
To: dev@spark.apache.org
Date:   2017/03/21 06:31
Subject:Why are DataFrames always read with nullable=True?



If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378


If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file 
or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is 
already
supported now.

Is this functionality still desirable? Is it potentially not applicable 
for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable 
to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






Re: Why are DataFrames always read with nullable=True?

2017-03-20 Thread Takeshi Yamamuro
Hi,

Have you check the related JIRA? e.g.,
https://issues.apache.org/jira/browse/SPARK-19950
If you have any ask and request, you'd better to do there.

Thanks!

// maropu


On Tue, Mar 21, 2017 at 6:30 AM, Jason White 
wrote:

> If I create a dataframe in Spark with non-nullable columns, and then save
> that to disk as a Parquet file, the columns are properly marked as
> non-nullable. I confirmed this using parquet-tools. Then, when loading it
> back, Spark forces the nullable back to True.
>
> https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/
> datasources/DataSource.scala#L378
>
> If I remove the `.asNullable` part, Spark performs exactly as I'd like by
> default, picking up the data using the schema either in the Parquet file or
> provided by me.
>
> This particular LoC goes back a year now, and I've seen a variety of
> discussions about this issue. In particular with Michael here:
> https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
> seemed to be discussing writing, not reading, though, and writing is
> already
> supported now.
>
> Is this functionality still desirable? Is it potentially not applicable for
> all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable
> to
> pass an option to the DataFrameReader to disable this functionality?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Why-are-DataFrames-
> always-read-with-nullable-True-tp21207.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro


Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Daniel Siegmann
Any chance of back-porting

SPARK-14536  - NPE in
JDBCRDD when array column contains nulls (postgresql)

It just adds a null check - just a simple bug fix - so it really belongs in
Spark 2.1.x.

On Mon, Mar 20, 2017 at 6:12 PM, Holden Karau  wrote:

> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522 
>  - --executor-memory flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell - https://github.com/apache/sp
> ark/pull/16906 PR exists but its difficult to add automated tests for
> this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759 
>  - ALSModel.predict on Dataframes : potential optimization by not using
> blas - No PR consider re-targeting unless someone has a PR waiting in the
> wings?
>
> Explicitly targeted issues are marked with a *, the remaining issues are
> listed as impacting 2.1.1 and don't have a specific target version set.
>
> Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
> blocker in SQL( SPARK-19983
>  ),
>
> Query string is:
>
> affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND
> resolution = Unresolved AND priority = targetPriority
>
> Continuing on for unresolved 2.1.0 issues in Major there are 163 (76 of
> them in progress), 65 Minor (26 in progress), and 9 trivial (6 in progress).
>
> I'll be going through the 2.1.0 major issues with open PRs that 

Outstanding Spark 2.1.1 issues

2017-03-20 Thread Holden Karau
Hi Spark Developers!

As we start working on the Spark 2.1.1 release I've been looking at our
outstanding issues still targeted for it. I've tried to break it down by
component so that people in charge of each component can take a quick look
and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
the overall list is pretty short (only 9 items - 5 if we only look at
explicitly tagged) :)

If your working on something for Spark 2.1.1 and it doesn't show up in this
list please speak up now :) We have a lot of issues (including "in
progress") that are listed as impacting 2.1.0, but they aren't targeted for
2.1.1 - if there is something you are working in their which should be
targeted for 2.1.1 please let us know so it doesn't slip through the cracks.

The query string I used for looking at the 2.1.1 open issues is:

((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
ORDER BY priority DESC

None of the open issues appear to be a regression from 2.1.0, but those
seem more likely to show up during the RC process (thanks in advance to
everyone testing their workloads :)) & generally none of them seem to be

(Note: the cfs are for Target Version/s field)

Critical Issues:
 SQL:
  SPARK-19690  - Join a
streaming DataFrame with a batch DataFrame may not work - PR
https://github.com/apache/spark/pull/17052 (review in progress by zsxwing,
currently failing Jenkins)*

Major Issues:
 SQL:
  SPARK-19035  - rand()
function in case when cause failed - no outstanding PR (consensus on JIRA
seems to be leaning towards it being a real issue but not necessarily
everyone agrees just yet - maybe we should slip this?)*
 Deploy:
  SPARK-19522 
 - --executor-memory flag doesn't work in local-cluster mode -
https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
but PR currently stalled waiting on response) *
 Core:
  SPARK-20025  - Driver
fail over will not work, if SPARK_LOCAL* env is set. -
https://github.com/apache/spark/pull/17357 (waiting on review) *
 PySpark:
 SPARK-19955  - Update
run-tests to support conda [ Part of Dropping 2.6 support -- which we
shouldn't do in a minor release -- but also fixes pip installability tests
to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)

Minor issues:
 Tests:
  SPARK-19612  - Tests
failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
consider explicitly targeting this for 2.2?
 PySpark:
  SPARK-19570  - Allow
to disable hive in pyspark shell - https://github.com/apache/
spark/pull/16906 PR exists but its difficult to add automated tests for
this (although if SPARK-19955
 gets in would make
testing this easier) - no reviewers yet. Possible re-target?*
 Structured Streaming:
  SPARK-19613  - Flaky
test: StateStoreRDDSuite.versioning and immutability - It's not targetted
for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
this for 2.2?
 ML:
  SPARK-19759 
 - ALSModel.predict on Dataframes : potential optimization by not using
blas - No PR consider re-targeting unless someone has a PR waiting in the
wings?

Explicitly targeted issues are marked with a *, the remaining issues are
listed as impacting 2.1.1 and don't have a specific target version set.

Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
blocker in SQL( SPARK-19983
 ),

Query string is:

affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND
resolution = Unresolved AND priority = targetPriority

Continuing on for unresolved 2.1.0 issues in Major there are 163 (76 of
them in progress), 65 Minor (26 in progress), and 9 trivial (6 in progress).

I'll be going through the 2.1.0 major issues with open PRs that impact the
PySpark component and seeing if any of them should be targeted for 2.1.1,
if anyone from the other components wants to take a look through we might
find some easy wins to be merged.

Cheers,

Holden :)

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Why are DataFrames always read with nullable=True?

2017-03-20 Thread Jason White
If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378

If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is already
supported now.

Is this functionality still desirable? Is it potentially not applicable for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Ted Yu
Timur:
Mind starting a new thread ?

I have the same question as you have. 

> On Mar 20, 2017, at 11:34 AM, Timur Shenkao  wrote:
> 
> Hello guys,
> 
> Spark benefits from stable versions not frequent ones.
> A lot of people still have 1.6.x in production. Those who wants the freshest 
> (like me) can always deploy night builts.
> My question is: how long version 1.6 will be supported? 
> 
> On Sunday, March 19, 2017, Holden Karau  wrote:
>> This discussions seems like it might benefit from its own thread as we've 
>> previously decided to lengthen release cycles but if their are different 
>> opinions about this it seems unrelated to the specific 2.1.1 release.
>> 
>>> On Sun, Mar 19, 2017 at 2:57 PM Jacek Laskowski  wrote:
>>> Hi Mark,
>>> 
>>> I appreciate your comment.
>>> 
>>> My thinking is that the more frequent minor and patch releases the
>>> more often end users can give them a shot and be part of the bigger
>>> release cycle for major releases. Spark's an OSS project and we all
>>> can make mistakes and my thinking is is that the more eyeballs the
>>> less the number of the mistakes. If we make very fine/minor releases
>>> often we should be able to attract more people who spend their time on
>>> testing/verification that eventually contribute to a higher quality of
>>> Spark.
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> 
>>> On Sun, Mar 19, 2017 at 10:50 PM, Mark Hamstra  
>>> wrote:
>>> > That doesn't necessarily follow, Jacek. There is a point where too 
>>> > frequent
>>> > releases decrease quality. That is because releases don't come for free --
>>> > each one demands a considerable amount of time from release managers,
>>> > testers, etc. -- time that would otherwise typically be devoted to 
>>> > improving
>>> > (or at least adding to) the code. And that doesn't even begin to consider
>>> > the time that needs to be spent putting a new version into a larger 
>>> > software
>>> > distribution or that users need to put in to deploy and use a new version.
>>> > If you have an extremely lightweight deployment cycle, then small, quick
>>> > releases can make sense; but "lightweight" doesn't really describe a Spark
>>> > release. The concern for excessive overhead is a large part of the 
>>> > thinking
>>> > behind why we stretched out the roadmap to allow longer intervals between
>>> > scheduled releases. A similar concern does come into play for unscheduled
>>> > maintenance releases -- but I don't think that that is the forcing 
>>> > function
>>> > at this point: A 2.1.1 release is a good idea.
>>> >
>>> > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski  wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> More smaller and more frequent releases (so major releases get even more
>>> >> quality).
>>> >>
>>> >> Jacek
>>> >>
>>> >> On 13 Mar 2017 8:07 p.m., "Holden Karau"  wrote:
>>> >>>
>>> >>> Hi Spark Devs,
>>> >>>
>>> >>> Spark 2.1 has been out since end of December and we've got quite a few
>>> >>> fixes merged for 2.1.1.
>>> >>>
>>> >>> On the Python side one of the things I'd like to see us get out into a
>>> >>> patch release is a packaging fix (now merged) before we upload to PyPI &
>>> >>> Conda, and we also have the normal batch of fixes like toLocalIterator 
>>> >>> for
>>> >>> large DataFrames in PySpark.
>>> >>>
>>> >>> I've chatted with Felix & Shivaram who seem to think the R side is
>>> >>> looking close to in good shape for a 2.1.1 release to submit to CRAN (if
>>> >>> I've miss-spoken my apologies). The two outstanding issues that are 
>>> >>> being
>>> >>> tracked for R are SPARK-18817, SPARK-19237.
>>> >>>
>>> >>> Looking at the other components quickly it seems like structured
>>> >>> streaming could also benefit from a patch release.
>>> >>>
>>> >>> What do others think - are there any issues people are actively 
>>> >>> targeting
>>> >>> for 2.1.1? Is this too early to be considering a patch release?
>>> >>>
>>> >>> Cheers,
>>> >>>
>>> >>> Holden
>>> >>> --
>>> >>> Cell : 425-233-8271
>>> >>> Twitter: https://twitter.com/holdenkarau
>>> >
>>> >
>> 
>> -- 
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau


Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Holden Karau
I think questions around how long the 1.6 series will be supported are
really important, but probably belong in a different thread than the 2.1.1
release discussion.

On Mon, Mar 20, 2017 at 11:34 AM Timur Shenkao  wrote:

> Hello guys,
>
> Spark benefits from stable versions not frequent ones.
> A lot of people still have 1.6.x in production. Those who wants the
> freshest (like me) can always deploy night builts.
> My question is: how long version 1.6 will be supported?
>
>
> On Sunday, March 19, 2017, Holden Karau  wrote:
>
> This discussions seems like it might benefit from its own thread as we've
> previously decided to lengthen release cycles but if their are different
> opinions about this it seems unrelated to the specific 2.1.1 release.
>
> On Sun, Mar 19, 2017 at 2:57 PM Jacek Laskowski  wrote:
>
> Hi Mark,
>
> I appreciate your comment.
>
> My thinking is that the more frequent minor and patch releases the
> more often end users can give them a shot and be part of the bigger
> release cycle for major releases. Spark's an OSS project and we all
> can make mistakes and my thinking is is that the more eyeballs the
> less the number of the mistakes. If we make very fine/minor releases
> often we should be able to attract more people who spend their time on
> testing/verification that eventually contribute to a higher quality of
> Spark.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Mar 19, 2017 at 10:50 PM, Mark Hamstra 
> wrote:
> > That doesn't necessarily follow, Jacek. There is a point where too
> frequent
> > releases decrease quality. That is because releases don't come for free
> --
> > each one demands a considerable amount of time from release managers,
> > testers, etc. -- time that would otherwise typically be devoted to
> improving
> > (or at least adding to) the code. And that doesn't even begin to consider
> > the time that needs to be spent putting a new version into a larger
> software
> > distribution or that users need to put in to deploy and use a new
> version.
> > If you have an extremely lightweight deployment cycle, then small, quick
> > releases can make sense; but "lightweight" doesn't really describe a
> Spark
> > release. The concern for excessive overhead is a large part of the
> thinking
> > behind why we stretched out the roadmap to allow longer intervals between
> > scheduled releases. A similar concern does come into play for unscheduled
> > maintenance releases -- but I don't think that that is the forcing
> function
> > at this point: A 2.1.1 release is a good idea.
> >
> > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski 
> wrote:
> >>
> >> +1
> >>
> >> More smaller and more frequent releases (so major releases get even more
> >> quality).
> >>
> >> Jacek
> >>
> >> On 13 Mar 2017 8:07 p.m., "Holden Karau"  wrote:
> >>>
> >>> Hi Spark Devs,
> >>>
> >>> Spark 2.1 has been out since end of December and we've got quite a few
> >>> fixes merged for 2.1.1.
> >>>
> >>> On the Python side one of the things I'd like to see us get out into a
> >>> patch release is a packaging fix (now merged) before we upload to PyPI
> &
> >>> Conda, and we also have the normal batch of fixes like toLocalIterator
> for
> >>> large DataFrames in PySpark.
> >>>
> >>> I've chatted with Felix & Shivaram who seem to think the R side is
> >>> looking close to in good shape for a 2.1.1 release to submit to CRAN
> (if
> >>> I've miss-spoken my apologies). The two outstanding issues that are
> being
> >>> tracked for R are SPARK-18817, SPARK-19237.
> >>>
> >>> Looking at the other components quickly it seems like structured
> >>> streaming could also benefit from a patch release.
> >>>
> >>> What do others think - are there any issues people are actively
> targeting
> >>> for 2.1.1? Is this too early to be considering a patch release?
> >>>
> >>> Cheers,
> >>>
> >>> Holden
> >>> --
> >>> Cell : 425-233-8271
> >>> Twitter: https://twitter.com/holdenkarau
> >
> >
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Timur Shenkao
Hello guys,

Spark benefits from stable versions not frequent ones.
A lot of people still have 1.6.x in production. Those who wants the
freshest (like me) can always deploy night builts.
My question is: how long version 1.6 will be supported?

On Sunday, March 19, 2017, Holden Karau  wrote:

> This discussions seems like it might benefit from its own thread as we've
> previously decided to lengthen release cycles but if their are different
> opinions about this it seems unrelated to the specific 2.1.1 release.
>
> On Sun, Mar 19, 2017 at 2:57 PM Jacek Laskowski  > wrote:
>
>> Hi Mark,
>>
>> I appreciate your comment.
>>
>> My thinking is that the more frequent minor and patch releases the
>> more often end users can give them a shot and be part of the bigger
>> release cycle for major releases. Spark's an OSS project and we all
>> can make mistakes and my thinking is is that the more eyeballs the
>> less the number of the mistakes. If we make very fine/minor releases
>> often we should be able to attract more people who spend their time on
>> testing/verification that eventually contribute to a higher quality of
>> Spark.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Mar 19, 2017 at 10:50 PM, Mark Hamstra > > wrote:
>> > That doesn't necessarily follow, Jacek. There is a point where too
>> frequent
>> > releases decrease quality. That is because releases don't come for free
>> --
>> > each one demands a considerable amount of time from release managers,
>> > testers, etc. -- time that would otherwise typically be devoted to
>> improving
>> > (or at least adding to) the code. And that doesn't even begin to
>> consider
>> > the time that needs to be spent putting a new version into a larger
>> software
>> > distribution or that users need to put in to deploy and use a new
>> version.
>> > If you have an extremely lightweight deployment cycle, then small, quick
>> > releases can make sense; but "lightweight" doesn't really describe a
>> Spark
>> > release. The concern for excessive overhead is a large part of the
>> thinking
>> > behind why we stretched out the roadmap to allow longer intervals
>> between
>> > scheduled releases. A similar concern does come into play for
>> unscheduled
>> > maintenance releases -- but I don't think that that is the forcing
>> function
>> > at this point: A 2.1.1 release is a good idea.
>> >
>> > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski > > wrote:
>> >>
>> >> +1
>> >>
>> >> More smaller and more frequent releases (so major releases get even
>> more
>> >> quality).
>> >>
>> >> Jacek
>> >>
>> >> On 13 Mar 2017 8:07 p.m., "Holden Karau" > > wrote:
>> >>>
>> >>> Hi Spark Devs,
>> >>>
>> >>> Spark 2.1 has been out since end of December and we've got quite a few
>> >>> fixes merged for 2.1.1.
>> >>>
>> >>> On the Python side one of the things I'd like to see us get out into a
>> >>> patch release is a packaging fix (now merged) before we upload to
>> PyPI &
>> >>> Conda, and we also have the normal batch of fixes like
>> toLocalIterator for
>> >>> large DataFrames in PySpark.
>> >>>
>> >>> I've chatted with Felix & Shivaram who seem to think the R side is
>> >>> looking close to in good shape for a 2.1.1 release to submit to CRAN
>> (if
>> >>> I've miss-spoken my apologies). The two outstanding issues that are
>> being
>> >>> tracked for R are SPARK-18817, SPARK-19237.
>> >>>
>> >>> Looking at the other components quickly it seems like structured
>> >>> streaming could also benefit from a patch release.
>> >>>
>> >>> What do others think - are there any issues people are actively
>> targeting
>> >>> for 2.1.1? Is this too early to be considering a patch release?
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Holden
>> >>> --
>> >>> Cell : 425-233-8271
>> >>> Twitter: https://twitter.com/holdenkarau
>> >
>> >
>>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
Do you want to get sparse model that most of the coefficients are zeros? If
yes, using L1 regularization leads to sparsity. But the
LogisticRegressionModel coefficients vector's size is still equal with the
number of features, you can get the non-zero elements manually. Actually,
it would be a sparse vector (or matrix for multinomial case) if it's sparse
enough.

Thanks
Yanbo

On Sun, Mar 19, 2017 at 5:02 AM, Dhanesh Padmanabhan  wrote:

> It shouldn't be difficult to convert the coefficients to a sparse vector.
> Not sure if that is what you are looking for
>
> -Dhanesh
>
> On Sun, Mar 19, 2017 at 5:02 PM jinhong lu  wrote:
>
> Thanks Dhanesh,  and how about the features question?
>
> 在 2017年3月19日,19:08,Dhanesh Padmanabhan  写道:
>
> Dhanesh
>
>
> Thanks,
> lujinhong
>
> --
> Dhanesh
> +91-9741125245
>


Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-20 Thread Chetan Khatri
Exactly.

On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee  wrote:

> Hello Chetan,
>
> Could you post some code? If I understood correctly, you are trying to
> save JSON like:
>
> {
>   "first_name": "Dongjin",
>   "last_name: null
> }
>
> not in omitted form, like:
>
> {
>   "first_name": "Dongjin"
> }
>
> right?
>
> - Dongjin
>
> On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri  > wrote:
>
>> Hello Dev / Users,
>>
>> I am working with PySpark Code migration to scala, with Python -
>> Iterating Spark with dictionary and generating JSON with null is possible
>> with json.dumps() which will be converted to SparkSQL[Row] but in scala how
>> can we generate json will null values as a Dataframe ?
>>
>> Thanks.
>>
>
>
>
> --
> *Dongjin Lee*
>
>
> *Software developer in Line+.So interested in massive-scale machine
> learning.facebook: www.facebook.com/dongjin.lee.kr
> linkedin: 
> kr.linkedin.com/in/dongjinleekr
> github:
> github.com/dongjinleekr
> twitter: www.twitter.com/dongjinleekr
> *
>