Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I see. It makes a lot of sense now. It is not unique to spark but it would
be great if it is mentioned in spark documentation.

I have been using hadoop for a while and I am not aware of it!


Zheng zheng

On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrbri...@gmail.com wrote:

 To be fair, this is a long-standing issue due to optimizations for object
 reuse in the Hadoop API, and isn't necessarily a failing in Spark - see
 this blog post (
 https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
 from 2011 documenting a similar issue.


 On June 11, 2015, at 3:17 PM, Sean Owen so...@cloudera.com wrote:


 Yep you need to use a transformation of the raw value; use toString for
 example.

 On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com
 wrote:

 That is a little scary.
  So you mean in general, we shouldn't use hadoop's writable as Key in
 RDD?

 Zheng zheng

 On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote:

 Guess: it has something to do with the Text object being reused by
 Hadoop? You can't in general keep around refs to them since they change. So
 you may have a bunch of copies of one object at the end that become just
 one in each partition.

 On Thu, Jun 11, 2015, 8:36 PM Crystal Xing crystalxin...@gmail.com
 wrote:

 I load a   list of ids from a text file as NLineInputFormat, and when I
 do distinct(), it returns incorrect number.
  JavaRDDText idListData = jvc
 .hadoopFile(idList, NLineInputFormat.class,
 LongWritable.class,
 Text.class).values().distinct()


 I should have 7000K distinct value, how every it only returns 7000
 values, which is the same as number of tasks.  The type I am using is
 import org.apache.hadoop.io.Text;


 However,  if I switch to use String instead of Text, it works correcly.

 I think the Text class should have correct implementation of equals()
 and hashCode() functions since it is the hadoop class.

 Does anyone have clue what is going on?

 I am using spark 1.2.

 Zheng zheng






Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I load a   list of ids from a text file as NLineInputFormat, and when I do
distinct(), it returns incorrect number.
 JavaRDDText idListData = jvc
.hadoopFile(idList, NLineInputFormat.class,
LongWritable.class, Text.class).values().distinct()


I should have 7000K distinct value, how every it only returns 7000 values,
which is the same as number of tasks.  The type I am using is
import org.apache.hadoop.io.Text;


However,  if I switch to use String instead of Text, it works correcly.

I think the Text class should have correct implementation of equals() and
hashCode() functions since it is the hadoop class.

Does anyone have clue what is going on?

I am using spark 1.2.

Zheng zheng


Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
That is a little scary.
 So you mean in general, we shouldn't use hadoop's writable as Key in RDD?

Zheng zheng

On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote:

 Guess: it has something to do with the Text object being reused by Hadoop?
 You can't in general keep around refs to them since they change. So you may
 have a bunch of copies of one object at the end that become just one in
 each partition.

 On Thu, Jun 11, 2015, 8:36 PM Crystal Xing crystalxin...@gmail.com
 wrote:

 I load a   list of ids from a text file as NLineInputFormat, and when I
 do distinct(), it returns incorrect number.
  JavaRDDText idListData = jvc
 .hadoopFile(idList, NLineInputFormat.class,
 LongWritable.class,
 Text.class).values().distinct()


 I should have 7000K distinct value, how every it only returns 7000
 values, which is the same as number of tasks.  The type I am using is
 import org.apache.hadoop.io.Text;


 However,  if I switch to use String instead of Text, it works correcly.

 I think the Text class should have correct implementation of equals() and
 hashCode() functions since it is the hadoop class.

 Does anyone have clue what is going on?

 I am using spark 1.2.

 Zheng zheng





Re: how to map and filter in one step?

2015-02-26 Thread Crystal Xing
I see.
The reason we can use flatmap to map to null but not using map to map to
null is because
flatmap supports map to zero and more  but map only support 1-1 mapping?

It seems Flatmap is more equivalent to haddop's map.


Thanks,

Zheng zhen

On Thu, Feb 26, 2015 at 10:44 AM, Sean Owen so...@cloudera.com wrote:

 You can flatMap:

 rdd.flatMap { in =
   if (condition(in)) {
 Some(transformation(in))
   } else {
 None
   }
 }

 On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing crystalxin...@gmail.com
 wrote:
  Hi,
  I have a text file input and I want to parse line by line and map each
 line
  to another format. But at the same time, I want to filter out some lines
 I
  do not need.
 
  I wonder if there is a way to filter out those lines in the map function.
 
  Do I have to do two steps filter and map?  In that way, I have to scan
 and
  parse the lines twice in order to filter and map.
 
  If I map those unwanted line to null and filter out null, will that work?
  never tried yet.
 
  Thanks,
 
  Zheng zheng



how to map and filter in one step?

2015-02-26 Thread Crystal Xing
Hi,
I have a text file input and I want to parse line by line and map each line
to another format. But at the same time, I want to filter out some lines I
do not need.

I wonder if there is a way to filter out those lines in the map function.

Do I have to do two steps filter and map?  In that way, I have to scan and
parse the lines twice in order to filter and map.

If I map those unwanted line to null and filter out null, will that work?
never tried yet.

Thanks,

Zheng zheng


Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Thanks, Sean! Glad to know it will be in the future release.

On Thu, Feb 12, 2015 at 2:45 PM, Sean Owen so...@cloudera.com wrote:

 Not now, but see https://issues.apache.org/jira/browse/SPARK-3066

 As an aside, it's quite expensive to make recommendations for all
 users. IMHO this is not something to do, if you can avoid it
 architecturally. For example, consider precomputing recommendations
 only for users whose probability of needing recommendations soon is
 not very small. Usually, only a small number of users are active.

 On Thu, Feb 12, 2015 at 10:26 PM, Crystal Xing crystalxin...@gmail.com
 wrote:
  Hi,
 
 
  I wonder if there is a way to do fast top N product recommendations for
 all
  users in training using mllib's ALS algorithm.
 
  I am currently calling
 
  public Rating[] recommendProducts(int user,
   int num)
 
  method in MatrixFactorizatoinModel for users one by one
  and it is quite slow since it does not operate on RDD input?
 
  I also tried to generate all possible
  user-product pairs and use
  public JavaRDDRating predict(JavaPairRDDInteger,Integer
 usersProducts)
 
  to fill out the matrix. Since I have a large number of user and products,
 
  the job stucks and transforming all pairs.
 
 
  I wonder if there is a better way to do this.
 
  Thanks,
 
  Crystal.



Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Hi,


I wonder if there is a way to do fast top N product recommendations for all
users in training using mllib's ALS algorithm.

I am currently calling

public Rating 
http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html[]
recommendProducts(int user,
 int num)

method in MatrixFactorizatoinModel for users one by one
and it is quite slow since it does not operate on RDD input?

I also tried to generate all possible
user-product pairs
*and use*public JavaRDD
http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/JavaRDD.htmlRating
http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html
predict(JavaPairRDD
http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/JavaPairRDD.htmlInteger,Integer
usersProducts)

to fill out the matrix. Since I have a large number of user and products,

the job stucks and transforming all pairs.


I wonder if there is a better way to do this.

Thanks,

Crystal.


Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
HI Sean,

I am reading the paper of implicit training.
Collaborative Filtering for Implicit Feedback Datasets
http://labs.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf

It mentioned

To this end, let us introduce
a set of binary variables p_ui, which indicates the preference of user u to
item i. The p_ui values are derived by
binarizing the r_ui values:
p_ui = 1 if  r_ui  0
and

p_ui=0 if  r_ui = 0




If for user_item without interactions, I do not include it in the training
data.  All the r_ui will 0 and all the p_ui is always 1?
Or the Mllib's implementation automatically takes care of those no
interaction user_product pairs ?


On Thu, Feb 12, 2015 at 3:13 PM, Sean Owen so...@cloudera.com wrote:

 Where there is no user-item interaction, you provide no interaction,
 not an interaction with strength 0. Otherwise your input is fully
 dense.

 On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing crystalxin...@gmail.com
 wrote:
  Hi,
 
  I have some implicit rating data, such as the purchasing data.  I read
 the
  paper about the implicit training algorithm used in spark and it
 mentioned
  the for user-prodct pairs which do not have implicit rating data, such
 as no
  purchase, we need to provide the value as 0.
 
  This is different from explicit training where when we provide training
  data, for user-product pair without a rating, we just do not have them in
  the training data instead of adding a user-product pair with rating 0.
 
  Am I understand this correctly?
 
   Or for implicit training implementation in spark, the missing data will
 be
  automatically filled out as zero and we do not need to add them in the
  training data set?
 
  Thanks,
 
  Crystal.



Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Many thanks!

On Thu, Feb 12, 2015 at 3:31 PM, Sean Owen so...@cloudera.com wrote:

 This all describes how the implementation operates, logically. The
 matrix P is never formed, for sure, certainly not by the caller.

 The implementation actually extends to handle negative values in R too
 but it's all taken care of by the implementation.

 On Thu, Feb 12, 2015 at 11:29 PM, Crystal Xing crystalxin...@gmail.com
 wrote:
  HI Sean,
 
  I am reading the paper of implicit training.
 
  Collaborative Filtering for Implicit Feedback Datasets
 
  It mentioned
 
  To this end, let us introduce
  a set of binary variables p_ui, which indicates the preference of user u
 to
  item i. The p_ui values are derived by
  binarizing the r_ui values:
  p_ui = 1 if  r_ui  0
  and
 
  p_ui=0 if  r_ui = 0
 
  
 
 
  If for user_item without interactions, I do not include it in the
 training
  data.  All the r_ui will 0 and all the p_ui is always 1?
  Or the Mllib's implementation automatically takes care of those no
  interaction user_product pairs ?
 
 
  On Thu, Feb 12, 2015 at 3:13 PM, Sean Owen so...@cloudera.com wrote:
 
  Where there is no user-item interaction, you provide no interaction,
  not an interaction with strength 0. Otherwise your input is fully
  dense.
 
  On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing crystalxin...@gmail.com
 
  wrote:
   Hi,
  
   I have some implicit rating data, such as the purchasing data.  I read
   the
   paper about the implicit training algorithm used in spark and it
   mentioned
   the for user-prodct pairs which do not have implicit rating data, such
   as no
   purchase, we need to provide the value as 0.
  
   This is different from explicit training where when we provide
 training
   data, for user-product pair without a rating, we just do not have them
   in
   the training data instead of adding a user-product pair with rating 0.
  
   Am I understand this correctly?
  
Or for implicit training implementation in spark, the missing data
 will
   be
   automatically filled out as zero and we do not need to add them in the
   training data set?
  
   Thanks,
  
   Crystal.
 
 



Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Hi,

I have some implicit rating data, such as the purchasing data.  I read the
paper about the implicit training algorithm used in spark and it mentioned
the for user-prodct pairs which do not have implicit rating data, such as
no purchase, we need to provide the value as 0.

This is different from explicit training where when we provide training
data, for user-product pair without a rating, we just do not have them in
the training data instead of adding a user-product pair with rating 0.

Am I understand this correctly?

 Or for implicit training implementation in spark, the missing data will be
automatically filled out as zero and we do not need to add them in the
training data set?

Thanks,

Crystal.