I have the same question. Trying to figure out how to get ALS to complete
with larger dataset. It seems to get stuck on "Count" from what I can tell.
I'm running 8 r4.4xlarge instances on Amazon EMR. The dataset is 80 GB (just
to give some idea of size). I assumed Spark could handle this, but maybe
I'm using CrossValidator (in PySpark) to create a logistic regression model.
There is "areaUnderROC", which I assume gives the AUC for the bestModel
chosen by CV. But how to get the areaUnderROC for the test data during the
cross-validation?
--
View this message in context:
http://apache-spark
When I am trying to use LinearRegression, it seems that unless there is a
column specified with weights, it will raise a py4j error. Seems odd because
supposedly the default is weightCol=None, but when I specifically pass in
weightCol=None to LinearRegression, I get this error.
--
View this mess
Am I misinterpreting what r2() in the LinearRegression Model summary means?
By definition, R^2 should never be a negative number!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/I-noticed-LinearRegression-sometimes-produces-negative-R-2-values-tp27667.html
S
I have a DataFrame with a column containing a list of numeric features to be
used for a regression. When I run the regression, I get the following error:
*pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column
features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
Trying to build a ML model using LogisticRegression, I ran into the following
unexplainable issue. Here's a snippet of code which
training, testing = data.randomSplit([0.8, 0.2], seed=42)
print("number of rows in testing = {}".format(testing.count()))
print("num
I'm building an LDA Pipeline, currently with 4 steps, Tokenizer,
StopWordsRemover, CountVectorizer, and LDA. I would like to add more steps,
for example, stemming and lemmatization, and also 1-gram and 2-grams (which
I believe is not supported by the default NGram class). Is there a way to
add thes