Re: Calculate sum of values in 2nd element of tuple

2016-01-03 Thread robert_dodier
jimitkr wrote
> I've tried fold, reduce, foldLeft but with no success in my below code to
> calculate total:
/
> val valuesForDEF=input.lookup("def")
> val totalForDEF: Int = valuesForDEF.toList.reduce((x: Int,y:
> Int)=>x+y)
> println("THE TOTAL FOR DEF IS" + totalForDEF)
/

Hmm, what exactly is the error message you get? From what I can tell, that
should work as expected.


> Another query. What will be the difference between the following tuples
> when created:
/
>   val
> input=sc.parallelize(List(("abc",List(1,2,3,4)),("def",List(5,6,7,8
/
> 
/
>   val input=sc.parallelize(List(("abc",(1,2,3,4)),("def",(5,6,7,8
/
> 
> Is there a difference in how (1,2,3,4) and List(1,2,3,4) is handled?

Well, the difference is that (1, 2, 3, 4) is a Tuple4 instead of a List. In
Scala, Tuples have some things in common with Lists and some differences.
You can probably find some discussion about that via a web search.

Depending on what you're trying to do, you'll prefer one or the other. I
believe in the example you gave before, you want List since reduce is not
defined for Tuples.

Hope this helps,

Robert Dodier




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calculate-sum-of-values-in-2nd-element-of-tuple-tp25865p25866.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: translate algorithm in spark

2016-01-03 Thread robert_dodier
domibd wrote
> find(v, collection) : boolean
> beign
> item = collection.first // assuming collection has at least
> one item
> 
>  while (item != v and collection has next item)
>   item = collection.nextItem
> 
>   return item == v
> end

I'm not an expert, so take my advice with a grain of salt. Anyway, one idea
you can try is to write a search function that works on the values in one
partition -- that part is sequential and not parallel. Then call
mapPartitions to map that function over all partitions in an RDD. Presumably
you will need to reduce the output of mapPartition (which, I guess, will be
a collection of Boolean values) by taking the logical disjunction (i.e., a
or b) of the output.

Hope this helps you figure out a solution.

Robert Dodier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/translate-algorithm-in-spark-tp25844p25867.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to handle categorical variables in Spark MLlib?

2015-12-26 Thread robert_dodier
hokam chauhan wrote
> So how the string value of categorical variable can be converted into
> double values for forming the features vector ?

Well, the key characteristic of the variables is that their values are not
ordered. So the representation you choose has to honor that. If the model is
doing some arithmetic on the inputs (e.g. a logistic regression model
computes a weighted sum of the inputs) or otherwise assuming an ordering of
values, then the appropriate representation is the so-called "one hot"
representation, in which a categorical variable of n possible values is
represented as a vector of length n, in which exactly one element is 1 and
the rest are 0.

Depending on the models you are using, other representations might be
possible. But a one-hot representation is widely applicable.


> Also how the weight for individual categories can be calculated for
> models. Like we have Gender as variable with categories as Male and Female
> and we want to give more weight to female category, then how this can be
> accomplished?  

Well, it probably depends on exactly what you mean by "more weight". If you
mean that one category is under-represented in the available data, and you
want to assume, let's say, that each datum in that category ought to count
the same as two data in another category, you could just create a data set
with an extra copy of those data. An equivalent method is to allow for
weighting the log-likelihood or other goodness of fit function. That's more
convenient and flexible (it allows for noninteger weights), but I don't
remember if Spark supports that. 

If you mean some other kind of weighting, you'll have to explain more about
what you're trying to achieve.


> Also is there a way through which string values from raw text can be
> converted to features vector(Apart from the HashingTF-IDF transformation)
> ?

I don't know any other method. Maybe someone else can suggest something.

best,

Robert Dodier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-categorical-variables-in-Spark-MLlib-tp25767p25803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark LogisticRegression returns scaled coefficients

2015-11-18 Thread robert_dodier
njoshi wrote
> I am testing the LogisticRegression performance on a synthetically
> generated data. 

Hmm, seems like a good idea. Can you give the code for generating the
training data?

best,

Robert Dodier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LogisticRegression-returns-scaled-coefficients-tp25405p25421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org