Re: On convenience methods

2016-10-14 Thread Reynold Xin
It is very difficult to give a general answer. We would need to discuss each case. In general things that are trivially doable using existing APIs, it is not a good idea to provide them, unless for compatibility with other frameworks (e.g. Pandas). On Fri, Oct 14, 2016 at 5:38 PM, roehst

On convenience methods

2016-10-14 Thread roehst
Hi, I sometimes write convenience methods for pre-processing data frames, and I wonder if it makes sense to make a contribution -- should this be included in Spark or supplied as Spark Packages/3rd party libraries? Example: Get all fields in a DataFrame schema of a certain type. I end up

Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-14 Thread Ryan Blue
The Spark 2 branch is based on this one: https://github.com/JoshRosen/hive/commits/release-1.2.1-spark2 rb On Fri, Oct 14, 2016 at 4:33 PM, Ethan Aubin wrote: > In an email thread [1] from Aug 2015, it was mentioned that the source > to org.spark-project.hive was at >

Re: cutting 1.6.3 release candidate

2016-10-14 Thread Reynold Xin
I took a look at the pull request for memory management and I actually agree with the existing assessment that the patch is too big and risky to port into an existing maintenance branch. Things that are backported are low-risk patches that won't break existing applications on 1.6.x. This patch is

source for org.spark-project.hive:1.2.1.spark2

2016-10-14 Thread Ethan Aubin
In an email thread [1] from Aug 2015, it was mentioned that the source to org.spark-project.hive was at https://github.com/pwendell/hive/commits/release-1.2.1-spark . That branch has a 1.2.1.spark version but spark 2.0.1 uses 1.2.1.spark2. Could anyone point me to the repo for 1.2.1.spark2? Thanks

Re: cutting 1.6.3 release candidate

2016-10-14 Thread Alexander Pivovarov
Also, can you include MaxPermSize fix to spark-1.6.3? https://issues.apache.org/jira/browse/SPARK-15067 Literally, just 1 word should be replaced https://github.com/apache/spark/pull/12985/files On Fri, Oct 14, 2016 at 1:57 PM, Alexander Pivovarov wrote: > Hi Reynold > >

Re: cutting 1.6.3 release candidate

2016-10-14 Thread Alexander Pivovarov
Hi Reynold Spark 1.6.x has serious bug related to shuffle functionality https://issues.apache.org/jira/browse/SPARK-14560 https://issues.apache.org/jira/browse/SPARK-4452 Shuffle throws OOM on serious load. I've seen this error several times on my heavy jobs java.lang.OutOfMemoryError: Unable

cutting 1.6.3 release candidate

2016-10-14 Thread Reynold Xin
It's been a while and we have fixed a few bugs in branch-1.6. I plan to cut rc1 for 1.6.3 next week (just in time for Spark Summit Europe). Let me know if there are specific issues that should be addressed before that. Thanks.

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Michael Armbrust
> > Additionally, shall I go ahead and open a ticket pointing out the missing > call to .asNullable in the streaming reader? > Yes please! This probably affects correctness.

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-14 Thread Fred Reiss
I think the way I phrased things earlier may be leading to some confusion here. When I said "don't bring down my application", I was referring to the application not meeting its end-to-end SLA, not to the app server crashing. The groups I've talked to already isolate their front-end systems from

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Aleksander Eskilson
I've opened a Jira to the issue you requested earlier: https://issues.apache.org/jira/browse/SPARK-17939 On Fri, Oct 14, 2016 at 9:24 AM Aleksander Eskilson wrote: > Interesting. I'm quite glad to read your explanation, it makes some of our > work quite a bit more

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Aleksander Eskilson
Interesting. I'm quite glad to read your explanation, it makes some of our work quite a bit more clear. I'll open a ticket in a similar vein to this discussion: https://github.com/apache/spark/pull/11785, contrasting nullability implementation as optimization and and enforcement. Additionally,

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-14 Thread mariusvniekerk
So for the jupyter integration pieces. I've made a simple library ( https://github.com/MaxPoint/spylon ) which allows a simpler way of creating a SparkContext (with all the parameters available to spark-submit) as well as some usability enhancements, progress

Re: Regularized Logistic regression

2016-10-14 Thread aditya1702
I used the cross validator tool for tuning the parameter. My code is here: from pyspark.ml.classification import LogisticRegression from pyspark.ml.tuning import ParamGridBuilder, CrossValidator from pyspark.ml.evaluation import BinaryClassificationEvaluator reg=100.0