Re: Crossvalidator after fit

2017-05-05 Thread Bryan Cutler
Looks like there might be a problem with the way you specified your parameter values, probably you have an integer value where it should be a floating-point. Double check that and if there is still a problem please share the rest of your code so we can see how you defined "gridS". On Fri, May 5,

Re: Structured Streaming + initialState

2017-05-05 Thread Tathagata Das
Can you explain how your initial state is stored? is it a file, or its in a database? If its in a database, then when initialize the GroupState, you can fetch it from the database. On Fri, May 5, 2017 at 7:35 AM, Patrick McGloin wrote: > Hi all, > > With Spark

is Spark Application code dependent on which mode we run?

2017-05-05 Thread kant kodali
Hi All, Does rdd.collect() call works for Client mode but not for cluster mode? If so, is there way for the Application to know which mode it is running in? It looks like for cluster mode we don't need to call rdd.collect() instead we can just call rdd.first() or whatever Thanks!

Re: Spark books

2017-05-05 Thread Jacek Laskowski
Thanks Stephen! I appreciate it very much. And yeah...Stephen is right on this. Go and read the notes and let me know where you're missing things :-) p.s. Holden has just announced that her book is complete and think Matei is also quite far with his writing. Jacek On 4 May 2017 2:52 a.m.,

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Pierce Lamb
Hi Nipun, To expand a bit, you might find this stackoverflow answer useful: http://stackoverflow.com/a/39753976/3723346 Most spark + database combinations can handle a use case like this. Hope this helps, Pierce On Thu, May 4, 2017 at 9:18 AM, Gene Pang wrote: > As Tim

Re: Where is release 2.1.1?

2017-05-05 Thread darren
Thanks. It looks like they posted the release just now because it wasn't showing before. Get Outlook for Android On Fri, May 5, 2017 at 11:04 AM -0400, "Jules Damji" wrote: Go to this link http://spark.apache.org/downloads.html CheersJules  Sent from

how to get assertDataFrameEquals ignore nullable

2017-05-05 Thread A Shaikh
As part of TDD I am using com.holdenkarau.spark.testing.DatasetSuiteBase to assert if 2 Dataframes values are equal using assertDataFrameEquals(dataframe1, dataframe2) Although the values are same but it fails assertion because nullable property does not match for some column. Is there are way

Where is release 2.1.1?

2017-05-05 Thread darren
Hi Website says it is released. Where can it be downloaded? Thanks Get Outlook for Android

Crossvalidator after fit

2017-05-05 Thread issues solution
Hi get the following error after trying to perform gridsearch and crossvalidation on randomforst estimator for classificaiton rf = RandomForestClassifier(labelCol="Labeld",featuresCol="features") evaluator = BinaryClassificationEvaluator(metricName="F1 Score") rf_cv =

how to get assertDataFrameEquals ignore nullable

2017-05-05 Thread A Shaikh
As part of TDD I am using com.holdenkarau.spark.testing.DatasetSuiteBase to assert if 2 Dataframes values are equal using assertDataFrameEquals(dataframe1, dataframe2) Although the values are same but it fails assertion because nullable property does not match for some column. Is there are way

Structured Streaming + initialState

2017-05-05 Thread Patrick McGloin
Hi all, With Spark Structured Streaming, is there a possibility to set an "initial state" for a query? Using a join between a streaming Dataset and a static Dataset does not support full joins. Using mapGroupsWithState to create a GroupState does not support an initialState (as the Spark

Reading ORC file - fine on 1.6; GC timeout on 2+

2017-05-05 Thread Nick Chammas
I have this ORC file that was generated by a Spark 1.6 program. It opens fine in Spark 1.6 with 6GB of driver memory, and probably less. However, when I try to open the same file in Spark 2.0 or 2.1, I get GC timeout exceptions. And this is with 6, 8, and even 10GB of driver memory. This is

hbase + spark + hdfs

2017-05-05 Thread mathieu ferlay
Hi everybody. I'm totally new in Spark and I wanna know one stuff that I do not manage to find. I have a full ambary install with hbase, Hadoop and spark. My code reads and writes in hdfs via hbase. Thus, as I understood, all data stored are in bytes format in hdfs. Now, I know that it's possible

Re: imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread DB Tsai
We have the weighting algorithms implemented in linear models, but unfortunately, it's not implemented in tree models. It's an important feature, and welcome for PR! Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID:

org.apache.hadoop.fs.FileSystem: Provider tachyon.hadoop.TFS could not be instantiated

2017-05-05 Thread Jone Zhang
*When i use sparksql, the error as follows* 17/05/05 15:58:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 4080, 10.196.143.233): java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider tachyon.hadoop.TFS could not be instantiated at

imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread issues solution
Hi , in sicki-learn we have sample_weights option that allow us to create array to balacne class category By calling like that rf.fit(X,Y,sample_weights=[10 10 10 ...1 1 10 ]) i 'am wondering if equivelent exist inside ml or mlib class ??? if yes can i ask refrence or example thx for

Re: map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-05 Thread peay
Hello, So, I assume there is nothing to apply/transform in structured streaming based on a function that takes a dataframe as input and output a dataframe as input? UDAF are kind of low level and require you to implement merge, and process individual rows in AFAIK (and are not available in