[jira] [Updated] (SPARK-20839) Incorrect Dynamic PageRank calculation

2017-05-22 Thread BahaaEddin AlAila (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BahaaEddin AlAila updated SPARK-20839:
--
Description: 
Correct me if I am wrong
I think there are three places where the pagerank calculation is incorrect
1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1)
val newPR = oldPR + (1.0 - resetProb) * msgSum
it should be
val newPR = resetProb + (1.0 - resetProb) * msgSum

2nd) in the message sending part (line 336 of the same file)
Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
should be 
Iterator((edge.dstId, edge.srcAttr._1 * edge.attr))
as we should be sending the edge weight multiplied by the current pagerank of 
the source vertex (not the vertex's delta)

3rd) the tol check against the abs of the delta (line 335)
  if (edge.srcAttr._2 > tol) {
should be
  if (Math.abs(edge.srcAttr._2) > tol) {
 


  was:
Correct me if I am wrong
I think there are two places where the pagerank calculation is incorrect
1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1)
val newPR = oldPR + (1.0 - resetProb) * msgSum
it should be
val newPR = resetProb + (1.0 - resetProb) * msgSum

2nd) in the message sending part (line 336 of the same file)
Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
should be 
Iterator((edge.dstId, edge.srcAttr._1 * edge.attr))
as we should be sending the edge weight multiplied by the current pagerank of 
the source vertex (not the vertex's delta)




> Incorrect Dynamic PageRank calculation
> --
>
> Key: SPARK-20839
> URL: https://issues.apache.org/jira/browse/SPARK-20839
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.1
>Reporter: BahaaEddin AlAila
>
> Correct me if I am wrong
> I think there are three places where the pagerank calculation is incorrect
> 1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1)
> val newPR = oldPR + (1.0 - resetProb) * msgSum
> it should be
> val newPR = resetProb + (1.0 - resetProb) * msgSum
> 2nd) in the message sending part (line 336 of the same file)
> Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
> should be 
> Iterator((edge.dstId, edge.srcAttr._1 * edge.attr))
> as we should be sending the edge weight multiplied by the current pagerank of 
> the source vertex (not the vertex's delta)
> 3rd) the tol check against the abs of the delta (line 335)
>   if (edge.srcAttr._2 > tol) {
> should be
>   if (Math.abs(edge.srcAttr._2) > tol) {
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20839) Incorrect Dynamic PageRank calculation

2017-05-22 Thread BahaaEddin AlAila (JIRA)
BahaaEddin AlAila created SPARK-20839:
-

 Summary: Incorrect Dynamic PageRank calculation
 Key: SPARK-20839
 URL: https://issues.apache.org/jira/browse/SPARK-20839
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.1.1
Reporter: BahaaEddin AlAila


Correct me if I am wrong
I think there are two places where the pagerank calculation is incorrect
1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1)
val newPR = oldPR + (1.0 - resetProb) * msgSum
it should be
val newPR = resetProb + (1.0 - resetProb) * msgSum

2nd) in the message sending part (line 336 of the same file)
Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
should be 
Iterator((edge.dstId, edge.srcAttr._1 * edge.attr))
as we should be sending the edge weight multiplied by the current pagerank of 
the source vertex (not the vertex's delta)





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871597#comment-15871597
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

What's puzzling though, is I looked at pyspark's implementation of 
binaryRecords, and it's just calling _jsc.binaryRecords and wrapping it with a 
pyspark RDD
so, if it is indeed calling the scala implementation, shouldn't pyspark have 
the same problem?

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871595#comment-15871595
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

Thank you very much for the speedy fix!

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871493#comment-15871493
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

Thank you very much for the quick reply.
All I did was the following in spark-shell:
val x = sc.binaryRecords('binary_file.bin',3073)
val t= x.take(3)
t(0)
t(1)
t(2)
// all returning the same array, even though they shouldn't be the same

in pyspark, I do the same:
x = sc.binaryRecords('binary_file.bin',3073)
t = x.take(3)
t[0]
t[1]
t[2]
// different legit results, verified manually as well.



> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Priority: Minor
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-16 Thread BahaaEddin AlAila (JIRA)
BahaaEddin AlAila created SPARK-19646:
-

 Summary: binaryRecords replicates records in scala API
 Key: SPARK-19646
 URL: https://issues.apache.org/jira/browse/SPARK-19646
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.0
Reporter: BahaaEddin AlAila
Priority: Minor


The scala sc.binaryRecords replicates one record for the entire set.
for example, I am trying to load the cifar binary data where in a big binary 
file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
label. The file resides on my local filesystem.
.take(5) returns 5 records all the same, .collect() returns 10,000 records all 
the same.
What is puzzling is that the pyspark one works perfectly even though underneath 
it is calling the scala implementation.
I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org