[jira] [Comment Edited] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

Liang-Chi Hsieh (JIRA) Wed, 15 Apr 2015 02:43:16 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495928#comment-14495928
 ]


Liang-Chi Hsieh edited comment on SPARK-6800 at 4/15/15 9:42 AM:
-----------------------------------------------------------------

About the upper and lower bounds issue, please refer to the pr page, Michael 
Armbrust gives the explanation why it is not a bug. Just the document needs to 
be updated for this issue.

Thanks.


was (Author: viirya):
About the upper and lower bounds issue, please refer to the pr page, Michael 
Armbrust gives the explanation why it is not a bug. Thanks.

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6800
>                 URL: https://issues.apache.org/jira/browse/SPARK-6800
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>            Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>       columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME             AGE
> 3         Ana Rita Costa   12 
> 5         Miguel Costa     15 
> 6         Anabela Sintra   13 
> 2         Lurdes Pereira   23 
> 4         Armando Pereira  32 
> 1         Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>       "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>       rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAME            AGE
> 3         Ana Rita Costa  12 
> 5         Miguel Costa    15 
> 6         Anabela Sintra  13 
> 2         Lurdes Pereira  23 
> 4         Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

Reply via email to