[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12030: --- Labels: correctness (was: ) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Labels: correctness > Fix For: 1.5.3, 1.6.0 > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: (was: spark.jpg) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: (was: t2.tar.gz) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: (was: t1.tar.gz) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > Attachments: spark.jpg, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12030: - Fix Version/s: 1.5.3 > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12030: - Assignee: Nong Li > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Nong Li >Priority: Blocker > Fix For: 1.6.0 > > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12030: - Target Version/s: 1.6.0 Priority: Blocker (was: Critical) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Description: I have following issue. I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) {code} t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") {code} Important: both table are cached, so results should be the same on every query. Then I did come counts: {code} t1.count() -> 5900729 t1.registerTempTable("t1") sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 t2.count() -> 54298 joined.count() -> 5900729 {code} And here magic begins - I counted distinct id1 from joined table {code} joined.registerTempTable("joined") sqlCtx.sql("select distinct(id1) from joined").count() {code} Results varies *(are different on every run)* between 5899000 and 590 but never are equal to 5900729. In addition. I did more queries: {code} sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 1").collect() {code} This gives some results but this query return *1* {code} len(sqlCtx.sql("select * from joined where id1 = result").collect()) {code} What's wrong ? was: I have following issue. I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) {code} t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") {code} Important: both table are cached, so results should be the same on every query. Then I did come counts: {code} t1.count() -> 5900729 t1.registerTempTable("t1") sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 t2.count() -> 54298 joined.count() -> 5900729 {code} And here magic begins - I counted distinct id1 from joined table {code} joined.registerTempTable("joined") sqlCtx.sql("select distinct(id1) from joined").count() {code} Results varies *(are different on every run)* between 5899000 and 590 but never are equal to 5900729. In addition. I did more queries: {code} sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 1").collect() {code} This gives some results but this query return *1* {code} len(sqlCtx.sql("select * from joined where id1 = result").collect()) {code} What's wrong ? > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: spark.jpg > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Attachment: t2.tar.gz t1.tar.gz > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Description: I have following issue. I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) {code} t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") {code} Important: both table are cached, so results should be the same on every query. Then I did come counts: {code} t1.count() -> 5900729 t1.registerTempTable("t1") sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 t2.count() -> 54298 joined.count() -> 5900729 {code} And here magic begins - I counted distinct id1 from joined table {code} joined.registerTempTable("joined") sqlCtx.sql("select distinct(id1) from joined").count() {code} Results varies *(are different on every run)* between 5899000 and 590 but never are equal to 5900729. In addition. I did more queries: {code} sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 1").collect() {code} This gives some results but this query return *1* {code} len(sqlCtx.sql("select * from joined where id1 = result").collect()) {code} What's wrong ? was: I have following issue. I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) {code} t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") {code} Important: both table are cached Then I did come counts: {code} t1.count() -> 5900729 t1.registerTempTable("t1") sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 t2.count() -> 54298 joined.count() -> 5900729 {code} And here magic begins - I counted distinct id1 from joined table {code} joined.registerTempTable("joined") sqlCtx.sql("select distinct(id1) from joined").count() {code} Results varies *(are different on every run)* between 5899000 and 590 but never are equal to 5900729. In addition. I did more queries: {code} sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 1").collect() {code} This gives some results but this query return *1* {code} len(sqlCtx.sql("select * from joined where id1 = result").collect()) {code} What's wrong ? > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Critical > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-12030: --- Summary: Incorrect results when aggregate joined data (was: Incorrect results when aggregate cached data from JDBC) > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Critical > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org