Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
. On June 11, 2015, at 3:17 PM, Sean Owen so...@cloudera.com wrote: Yep you need to use a transformation of the raw value; use toString for example. On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com wrote: That is a little scary. So you mean in general, we shouldn't use

Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDDText idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing crystalxin...@gmail.com wrote: I load a list of ids from a text file as NLineInputFormat, and when I do distinct

Re: how to map and filter in one step?

2015-02-26 Thread Crystal Xing
...@cloudera.com wrote: You can flatMap: rdd.flatMap { in = if (condition(in)) { Some(transformation(in)) } else { None } } On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I have a text file input and I want to parse line by line and map each

how to map and filter in one step?

2015-02-26 Thread Crystal Xing
Hi, I have a text file input and I want to parse line by line and map each line to another format. But at the same time, I want to filter out some lines I do not need. I wonder if there is a way to filter out those lines in the map function. Do I have to do two steps filter and map? In that

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
this is not something to do, if you can avoid it architecturally. For example, consider precomputing recommendations only for users whose probability of needing recommendations soon is not very small. Usually, only a small number of users are active. On Thu, Feb 12, 2015 at 10:26 PM, Crystal Xing crystalxin

Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html[] recommendProducts(int user,

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
no interaction user_product pairs ? On Thu, Feb 12, 2015 at 3:13 PM, Sean Owen so...@cloudera.com wrote: Where there is no user-item interaction, you provide no interaction, not an interaction with strength 0. Otherwise your input is fully dense. On Thu, Feb 12, 2015 at 11:09 PM, Crystal

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
but it's all taken care of by the implementation. On Thu, Feb 12, 2015 at 11:29 PM, Crystal Xing crystalxin...@gmail.com wrote: HI Sean, I am reading the paper of implicit training. Collaborative Filtering for Implicit Feedback Datasets It mentioned To this end, let us introduce

Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different