Hi,
There is a parameter in the HashingTF called "numFeatures". I was wondering
what is the best way to set the value to this parameter. In the use case of
text categorization, do you need to know in advance the number of words in
your vocabulary? or do you set it to be a large value, greater
?
From: Jianguo Li
Date: Monday, June 22, 2015 at 6:21 PM
To: Silvio Fiorito
Cc: user@spark.apache.org
Subject: Re: workaround for groupByKey
Thanks for your suggestion. I guess aggregateByKey is similar to
combineByKey. I read in the Learning Sparking
*We can disable map-side aggregation
Hi,
I am processing an RDD of key-value pairs. The key is an user_id, and the
value is an website url the user has ever visited.
Since I need to know all the urls each user has visited, I am tempted to
call the groupByKey on this RDD. However, since there could be millions of
users and urls,
Hi,
I am training a model using the logistic regression algorithm in ML. I was
wondering if there is any API to access the weight vectors (aka the
co-efficients for each feature). I need those co-efficients for real time
predictions.
Thanks,
Jianguo
Hi,
In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it
says if userFeatureScaling is enabled, we will standardize the training
features , and trained the model in the scaled space. Then we transform
the coefficients from the scaled space to the original space
My
Hi,
I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
constant, I was wondering if this is due to the kFold function I used.
at 4:12 PM, Jianguo Li flyingfromch...@gmail.com
wrote:
Hi,
I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
Hi,
I created some unit tests to test some of the functions in my project which
use Spark. However, when I used the sbt tool to build it and then ran the
sbt test, I ran into java.io.IOException: Could not create FileClient:
2015-01-19 08:50:38,1894 ERROR Client
Hi,
I am using the sbt tool to build and run the scala tests related to spark.
In my /src/test/scala directory, there are two test classes (TestA, TestB),
both of which use the class in Spark for creating SparkContext, something
like
trait LocalTestSparkContext extends BeforeAndAfterAll { self:
I solved the issue. In case anyone else is looking for an answer, by
default, scalatest executes all the tests in parallel. To disable this,
just put the following line in your build.sbt
parallelExecution in Test := false
Thanks
On Wed, Jan 14, 2015 at 2:30 PM, Jianguo Li flyingfromch
I am using Spark-1.1.1. When I used sbt test, I ran into the
following exceptions. Any idea how to solve it? Thanks! I think
somebody posted this question before, but no one seemed to have
answered it. Could it be the version of io.netty I put in my
build.sbt? I included an dependency
I am using Spark-1.1.1. When I used sbt test, I ran into the
following exceptions. Any idea how to solve it? Thanks! I think
somebody posted this question before, but no one seemed to have
answered it. Could it be the version of io.netty I put in my
build.sbt? I included an dependency
Hi,
I am trying to build my own scala project using sbt. The project is
dependent on both spark-score and spark-mllib. I included the following two
dependencies in my build.sbt file
libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1
libraryDependencies += org.apache.spark %%
Hi,
A while ago, somebody asked about getting a confidence value of a
prediction with MLlib's implementation of Naive Bayes's classification.
I was wondering if there is any plan in the near future for the predict
function to return both a label and a confidence/probability? Or could the
private
14 matches
Mail list logo