[jira] [Updated] (SPARK-20839) Incorrect Dynamic PageRank calculation
[ https://issues.apache.org/jira/browse/SPARK-20839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BahaaEddin AlAila updated SPARK-20839: -- Description: Correct me if I am wrong I think there are three places where the pagerank calculation is incorrect 1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1) val newPR = oldPR + (1.0 - resetProb) * msgSum it should be val newPR = resetProb + (1.0 - resetProb) * msgSum 2nd) in the message sending part (line 336 of the same file) Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) should be Iterator((edge.dstId, edge.srcAttr._1 * edge.attr)) as we should be sending the edge weight multiplied by the current pagerank of the source vertex (not the vertex's delta) 3rd) the tol check against the abs of the delta (line 335) if (edge.srcAttr._2 > tol) { should be if (Math.abs(edge.srcAttr._2) > tol) { was: Correct me if I am wrong I think there are two places where the pagerank calculation is incorrect 1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1) val newPR = oldPR + (1.0 - resetProb) * msgSum it should be val newPR = resetProb + (1.0 - resetProb) * msgSum 2nd) in the message sending part (line 336 of the same file) Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) should be Iterator((edge.dstId, edge.srcAttr._1 * edge.attr)) as we should be sending the edge weight multiplied by the current pagerank of the source vertex (not the vertex's delta) > Incorrect Dynamic PageRank calculation > -- > > Key: SPARK-20839 > URL: https://issues.apache.org/jira/browse/SPARK-20839 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.1.1 >Reporter: BahaaEddin AlAila > > Correct me if I am wrong > I think there are three places where the pagerank calculation is incorrect > 1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1) > val newPR = oldPR + (1.0 - resetProb) * msgSum > it should be > val newPR = resetProb + (1.0 - resetProb) * msgSum > 2nd) in the message sending part (line 336 of the same file) > Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) > should be > Iterator((edge.dstId, edge.srcAttr._1 * edge.attr)) > as we should be sending the edge weight multiplied by the current pagerank of > the source vertex (not the vertex's delta) > 3rd) the tol check against the abs of the delta (line 335) > if (edge.srcAttr._2 > tol) { > should be > if (Math.abs(edge.srcAttr._2) > tol) { > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20839) Incorrect Dynamic PageRank calculation
BahaaEddin AlAila created SPARK-20839: - Summary: Incorrect Dynamic PageRank calculation Key: SPARK-20839 URL: https://issues.apache.org/jira/browse/SPARK-20839 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.1.1 Reporter: BahaaEddin AlAila Correct me if I am wrong I think there are two places where the pagerank calculation is incorrect 1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1) val newPR = oldPR + (1.0 - resetProb) * msgSum it should be val newPR = resetProb + (1.0 - resetProb) * msgSum 2nd) in the message sending part (line 336 of the same file) Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) should be Iterator((edge.dstId, edge.srcAttr._1 * edge.attr)) as we should be sending the edge weight multiplied by the current pagerank of the source vertex (not the vertex's delta) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API
[ https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871597#comment-15871597 ] BahaaEddin AlAila commented on SPARK-19646: --- What's puzzling though, is I looked at pyspark's implementation of binaryRecords, and it's just calling _jsc.binaryRecords and wrapping it with a pyspark RDD so, if it is indeed calling the scala implementation, shouldn't pyspark have the same problem? > binaryRecords replicates records in scala API > - > > Key: SPARK-19646 > URL: https://issues.apache.org/jira/browse/SPARK-19646 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0 >Reporter: BahaaEddin AlAila >Assignee: Sean Owen > > The scala sc.binaryRecords replicates one record for the entire set. > for example, I am trying to load the cifar binary data where in a big binary > file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label > label. The file resides on my local filesystem. > .take(5) returns 5 records all the same, .collect() returns 10,000 records > all the same. > What is puzzling is that the pyspark one works perfectly even though > underneath it is calling the scala implementation. > I have tested this on 2.1.0 and 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API
[ https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871595#comment-15871595 ] BahaaEddin AlAila commented on SPARK-19646: --- Thank you very much for the speedy fix! > binaryRecords replicates records in scala API > - > > Key: SPARK-19646 > URL: https://issues.apache.org/jira/browse/SPARK-19646 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0 >Reporter: BahaaEddin AlAila >Assignee: Sean Owen > > The scala sc.binaryRecords replicates one record for the entire set. > for example, I am trying to load the cifar binary data where in a big binary > file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label > label. The file resides on my local filesystem. > .take(5) returns 5 records all the same, .collect() returns 10,000 records > all the same. > What is puzzling is that the pyspark one works perfectly even though > underneath it is calling the scala implementation. > I have tested this on 2.1.0 and 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API
[ https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871493#comment-15871493 ] BahaaEddin AlAila commented on SPARK-19646: --- Thank you very much for the quick reply. All I did was the following in spark-shell: val x = sc.binaryRecords('binary_file.bin',3073) val t= x.take(3) t(0) t(1) t(2) // all returning the same array, even though they shouldn't be the same in pyspark, I do the same: x = sc.binaryRecords('binary_file.bin',3073) t = x.take(3) t[0] t[1] t[2] // different legit results, verified manually as well. > binaryRecords replicates records in scala API > - > > Key: SPARK-19646 > URL: https://issues.apache.org/jira/browse/SPARK-19646 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0 >Reporter: BahaaEddin AlAila >Priority: Minor > > The scala sc.binaryRecords replicates one record for the entire set. > for example, I am trying to load the cifar binary data where in a big binary > file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label > label. The file resides on my local filesystem. > .take(5) returns 5 records all the same, .collect() returns 10,000 records > all the same. > What is puzzling is that the pyspark one works perfectly even though > underneath it is calling the scala implementation. > I have tested this on 2.1.0 and 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19646) binaryRecords replicates records in scala API
BahaaEddin AlAila created SPARK-19646: - Summary: binaryRecords replicates records in scala API Key: SPARK-19646 URL: https://issues.apache.org/jira/browse/SPARK-19646 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.0.0 Reporter: BahaaEddin AlAila Priority: Minor The scala sc.binaryRecords replicates one record for the entire set. for example, I am trying to load the cifar binary data where in a big binary file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label label. The file resides on my local filesystem. .take(5) returns 5 records all the same, .collect() returns 10,000 records all the same. What is puzzling is that the pyspark one works perfectly even though underneath it is calling the scala implementation. I have tested this on 2.1.0 and 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org