Re: Spark distinct() returns incorrect results for some types?
I see. It makes a lot of sense now. It is not unique to spark but it would be great if it is mentioned in spark documentation. I have been using hadoop for a while and I am not aware of it! Zheng zheng On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrbri...@gmail.com wrote: To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop API, and isn't necessarily a failing in Spark - see this blog post ( https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) from 2011 documenting a similar issue. On June 11, 2015, at 3:17 PM, Sean Owen so...@cloudera.com wrote: Yep you need to use a transformation of the raw value; use toString for example. On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com wrote: That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote: Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing crystalxin...@gmail.com wrote: I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDDText idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have 7000K distinct value, how every it only returns 7000 values, which is the same as number of tasks. The type I am using is import org.apache.hadoop.io.Text; However, if I switch to use String instead of Text, it works correcly. I think the Text class should have correct implementation of equals() and hashCode() functions since it is the hadoop class. Does anyone have clue what is going on? I am using spark 1.2. Zheng zheng
Spark distinct() returns incorrect results for some types?
I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDDText idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have 7000K distinct value, how every it only returns 7000 values, which is the same as number of tasks. The type I am using is import org.apache.hadoop.io.Text; However, if I switch to use String instead of Text, it works correcly. I think the Text class should have correct implementation of equals() and hashCode() functions since it is the hadoop class. Does anyone have clue what is going on? I am using spark 1.2. Zheng zheng
Re: Spark distinct() returns incorrect results for some types?
That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote: Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing crystalxin...@gmail.com wrote: I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDDText idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have 7000K distinct value, how every it only returns 7000 values, which is the same as number of tasks. The type I am using is import org.apache.hadoop.io.Text; However, if I switch to use String instead of Text, it works correcly. I think the Text class should have correct implementation of equals() and hashCode() functions since it is the hadoop class. Does anyone have clue what is going on? I am using spark 1.2. Zheng zheng
Re: how to map and filter in one step?
I see. The reason we can use flatmap to map to null but not using map to map to null is because flatmap supports map to zero and more but map only support 1-1 mapping? It seems Flatmap is more equivalent to haddop's map. Thanks, Zheng zhen On Thu, Feb 26, 2015 at 10:44 AM, Sean Owen so...@cloudera.com wrote: You can flatMap: rdd.flatMap { in = if (condition(in)) { Some(transformation(in)) } else { None } } On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I have a text file input and I want to parse line by line and map each line to another format. But at the same time, I want to filter out some lines I do not need. I wonder if there is a way to filter out those lines in the map function. Do I have to do two steps filter and map? In that way, I have to scan and parse the lines twice in order to filter and map. If I map those unwanted line to null and filter out null, will that work? never tried yet. Thanks, Zheng zheng
how to map and filter in one step?
Hi, I have a text file input and I want to parse line by line and map each line to another format. But at the same time, I want to filter out some lines I do not need. I wonder if there is a way to filter out those lines in the map function. Do I have to do two steps filter and map? In that way, I have to scan and parse the lines twice in order to filter and map. If I map those unwanted line to null and filter out null, will that work? never tried yet. Thanks, Zheng zheng
Re: Is there a fast way to do fast top N product recommendations for all users
Thanks, Sean! Glad to know it will be in the future release. On Thu, Feb 12, 2015 at 2:45 PM, Sean Owen so...@cloudera.com wrote: Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 As an aside, it's quite expensive to make recommendations for all users. IMHO this is not something to do, if you can avoid it architecturally. For example, consider precomputing recommendations only for users whose probability of needing recommendations soon is not very small. Usually, only a small number of users are active. On Thu, Feb 12, 2015 at 10:26 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating[] recommendProducts(int user, int num) method in MatrixFactorizatoinModel for users one by one and it is quite slow since it does not operate on RDD input? I also tried to generate all possible user-product pairs and use public JavaRDDRating predict(JavaPairRDDInteger,Integer usersProducts) to fill out the matrix. Since I have a large number of user and products, the job stucks and transforming all pairs. I wonder if there is a better way to do this. Thanks, Crystal.
Is there a fast way to do fast top N product recommendations for all users
Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html[] recommendProducts(int user, int num) method in MatrixFactorizatoinModel for users one by one and it is quite slow since it does not operate on RDD input? I also tried to generate all possible user-product pairs *and use*public JavaRDD http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/JavaRDD.htmlRating http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html predict(JavaPairRDD http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/JavaPairRDD.htmlInteger,Integer usersProducts) to fill out the matrix. Since I have a large number of user and products, the job stucks and transforming all pairs. I wonder if there is a better way to do this. Thanks, Crystal.
Re: Question about mllib als's implicit training
HI Sean, I am reading the paper of implicit training. Collaborative Filtering for Implicit Feedback Datasets http://labs.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf It mentioned To this end, let us introduce a set of binary variables p_ui, which indicates the preference of user u to item i. The p_ui values are derived by binarizing the r_ui values: p_ui = 1 if r_ui 0 and p_ui=0 if r_ui = 0 If for user_item without interactions, I do not include it in the training data. All the r_ui will 0 and all the p_ui is always 1? Or the Mllib's implementation automatically takes care of those no interaction user_product pairs ? On Thu, Feb 12, 2015 at 3:13 PM, Sean Owen so...@cloudera.com wrote: Where there is no user-item interaction, you provide no interaction, not an interaction with strength 0. Otherwise your input is fully dense. On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different from explicit training where when we provide training data, for user-product pair without a rating, we just do not have them in the training data instead of adding a user-product pair with rating 0. Am I understand this correctly? Or for implicit training implementation in spark, the missing data will be automatically filled out as zero and we do not need to add them in the training data set? Thanks, Crystal.
Re: Question about mllib als's implicit training
Many thanks! On Thu, Feb 12, 2015 at 3:31 PM, Sean Owen so...@cloudera.com wrote: This all describes how the implementation operates, logically. The matrix P is never formed, for sure, certainly not by the caller. The implementation actually extends to handle negative values in R too but it's all taken care of by the implementation. On Thu, Feb 12, 2015 at 11:29 PM, Crystal Xing crystalxin...@gmail.com wrote: HI Sean, I am reading the paper of implicit training. Collaborative Filtering for Implicit Feedback Datasets It mentioned To this end, let us introduce a set of binary variables p_ui, which indicates the preference of user u to item i. The p_ui values are derived by binarizing the r_ui values: p_ui = 1 if r_ui 0 and p_ui=0 if r_ui = 0 If for user_item without interactions, I do not include it in the training data. All the r_ui will 0 and all the p_ui is always 1? Or the Mllib's implementation automatically takes care of those no interaction user_product pairs ? On Thu, Feb 12, 2015 at 3:13 PM, Sean Owen so...@cloudera.com wrote: Where there is no user-item interaction, you provide no interaction, not an interaction with strength 0. Otherwise your input is fully dense. On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing crystalxin...@gmail.com wrote: Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different from explicit training where when we provide training data, for user-product pair without a rating, we just do not have them in the training data instead of adding a user-product pair with rating 0. Am I understand this correctly? Or for implicit training implementation in spark, the missing data will be automatically filled out as zero and we do not need to add them in the training data set? Thanks, Crystal.
Question about mllib als's implicit training
Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different from explicit training where when we provide training data, for user-product pair without a rating, we just do not have them in the training data instead of adding a user-product pair with rating 0. Am I understand this correctly? Or for implicit training implementation in spark, the missing data will be automatically filled out as zero and we do not need to add them in the training data set? Thanks, Crystal.