[ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495928#comment-14495928 ]
Liang-Chi Hsieh edited comment on SPARK-6800 at 4/15/15 9:42 AM: ----------------------------------------------------------------- About the upper and lower bounds issue, please refer to the pr page, Michael Armbrust gives the explanation why it is not a bug. Just the document needs to be updated for this issue. Thanks. was (Author: viirya): About the upper and lower bounds issue, please refer to the pr page, Michael Armbrust gives the explanation why it is not a bug. Thanks. > Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions > gives incorrect results. > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-6800 > URL: https://issues.apache.org/jira/browse/SPARK-6800 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, > Scala 2.10 > Reporter: Micael Capitão > > Having a Derby table with people info (id, name, age) defined like this: > {code} > val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true" > val conn = DriverManager.getConnection(jdbcUrl) > val stmt = conn.createStatement() > stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS > IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)") > {code} > If I try to read that table from Spark SQL with lower/upper bounds, like this: > {code} > val people = sqlContext.jdbc(url = jdbcUrl, table = "Person", > columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10) > people.show() > {code} > I get this result: > {noformat} > PERSON_ID NAME AGE > 3 Ana Rita Costa 12 > 5 Miguel Costa 15 > 6 Anabela Sintra 13 > 2 Lurdes Pereira 23 > 4 Armando Pereira 32 > 1 Armando Carvalho 50 > {noformat} > Which is wrong, considering the defined upper bound has been ignored (I get a > person with age 50!). > Digging the code, I've found that in {{JDBCRelation.columnPartition}} the > WHERE clauses it generates are the following: > {code} > (0) age < 4,0 > (1) age >= 4 AND age < 8,1 > (2) age >= 8 AND age < 12,2 > (3) age >= 12 AND age < 16,3 > (4) age >= 16 AND age < 20,4 > (5) age >= 20 AND age < 24,5 > (6) age >= 24 AND age < 28,6 > (7) age >= 28 AND age < 32,7 > (8) age >= 32 AND age < 36,8 > (9) age >= 36,9 > {code} > The last condition ignores the upper bound and the other ones may result in > repeated rows being read. > Using the JdbcRDD (and converting it to a DataFrame) I would have something > like this: > {code} > val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl), > "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10, > rs => (rs.getInt(1), rs.getString(2), rs.getInt(3))) > val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE") > people.show() > {code} > Resulting in: > {noformat} > PERSON_ID NAME AGE > 3 Ana Rita Costa 12 > 5 Miguel Costa 15 > 6 Anabela Sintra 13 > 2 Lurdes Pereira 23 > 4 Armando Pereira 32 > {noformat} > Which is correct! > Confirming the WHERE clauses generated by the JdbcRDD in the > {{getPartitions}} I've found it generates the following: > {code} > (0) age >= 0 AND age <= 3 > (1) age >= 4 AND age <= 7 > (2) age >= 8 AND age <= 11 > (3) age >= 12 AND age <= 15 > (4) age >= 16 AND age <= 19 > (5) age >= 20 AND age <= 23 > (6) age >= 24 AND age <= 27 > (7) age >= 28 AND age <= 31 > (8) age >= 32 AND age <= 35 > (9) age >= 36 AND age <= 40 > {code} > This is the behaviour I was expecting from the Spark SQL version. Is the > Spark SQL version buggy or is this some weird expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org