Re: ALS block settings

2018-10-23 Thread evanzamir
I have the same question. Trying to figure out how to get ALS to complete with larger dataset. It seems to get stuck on "Count" from what I can tell. I'm running 8 r4.4xlarge instances on Amazon EMR. The dataset is 80 GB (just to give some idea of size). I assumed Spark could handle this, but

Is there a way to get the AUC metric for CrossValidator?

2016-09-29 Thread evanzamir
I'm using CrossValidator (in PySpark) to create a logistic regression model. There is "areaUnderROC", which I assume gives the AUC for the bestModel chosen by CV. But how to get the areaUnderROC for the test data during the cross-validation? -- View this message in context:

weightCol doesn't seem to be handled properly in PySpark

2016-09-07 Thread evanzamir
When I am trying to use LinearRegression, it seems that unless there is a column specified with weights, it will raise a py4j error. Seems odd because supposedly the default is weightCol=None, but when I specifically pass in weightCol=None to LinearRegression, I get this error. -- View this

I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread evanzamir
Am I misinterpreting what r2() in the LinearRegression Model summary means? By definition, R^2 should never be a negative number! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/I-noticed-LinearRegression-sometimes-produces-negative-R-2-values-tp27667.html

How to convert an ArrayType to DenseVector within DataFrame?

2016-08-30 Thread evanzamir
I have a DataFrame with a column containing a list of numeric features to be used for a regression. When I run the regression, I get the following error: *pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7

DataFrameWriter bug after RandomSplit?

2016-08-22 Thread evanzamir
Trying to build a ML model using LogisticRegression, I ran into the following unexplainable issue. Here's a snippet of code which training, testing = data.randomSplit([0.8, 0.2], seed=42) print("number of rows in testing = {}".format(testing.count()))

How to add custom steps to Pipeline models?

2016-08-12 Thread evanzamir
I'm building an LDA Pipeline, currently with 4 steps, Tokenizer, StopWordsRemover, CountVectorizer, and LDA. I would like to add more steps, for example, stemming and lemmatization, and also 1-gram and 2-grams (which I believe is not supported by the default NGram class). Is there a way to add