subject:"\[jira\] \[Commented\] \(SPARK\-6800\) Reading from JDBC with SQLContext, using lower\/upper bounds and numPartitions gives incorrect results."

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-15 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495945#comment-14495945
 ] 

Liang-Chi Hsieh commented on SPARK-6800:


PR: https://github.com/apache/spark/pull/5488

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495940#comment-14495940
 ] 

Micael Capitão commented on SPARK-6800:
---

Could you please point me to that page?
I'm bit confused because not considering the upper/lower bounds would cause the 
fetch of the whole table and cause the issue I've pointed here in which a 
person with an age out of bound is retrieved.
Thanks.

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-15 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495928#comment-14495928
 ] 

Liang-Chi Hsieh commented on SPARK-6800:


About the upper and lower bounds issue, please refer to the pr page, Michael 
Armbrust gives the explanation why it is not a bug. Thanks.

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495926#comment-14495926
 ] 

Micael Capitão commented on SPARK-6800:
---

My fault. I'm sorry. I have looked to the wrong place to retrieve the where 
clauses.
"age >= 8  AND age < 12,2" is where="age >= 8  AND age < 12" and partition=2

Fixing the lower/upper bounds issue is all is needed.

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-15 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495921#comment-14495921
 ] 

Liang-Chi Hsieh commented on SPARK-6800:


The ranges for partitions 1 to 8 are not overlapped. They are:
(1) 8 > age >= 4
(2) 12 > age >= 8
(3) 16 > age >= 12
(4) 20 > age >= 16
(5) 24 > age >= 20
(6) 28 > age >= 24
(7) 32 > age >= 28
(8) 36 > age >= 32

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495908#comment-14495908
 ] 

Micael Capitão commented on SPARK-6800:
---

In the example I've provided, there will be no repeated rows read because of my 
dataset.
But if the queries generated are as this:
(0) age < 4,0
(1) age >= 4  AND age < 8,1
(2) age >= 8  AND age < 12,2
(3) age >= 12 AND age < 16,3
(4) age >= 16 AND age < 20,4
(5) age >= 20 AND age < 24,5
(6) age >= 24 AND age < 28,6
(7) age >= 28 AND age < 32,7
(8) age >= 32 AND age < 36,8
(9) age >= 36,9

... the ranges for the partitions 1 to 8 overlap. In a real, more complex, 
scenario that would be a same record in multiple partitions.
The title of the issue does not state that issue, but the description states 
it...

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495570#comment-14495570
 ] 

Liang-Chi Hsieh commented on SPARK-6800:


And according to the explanation from Michael Armbrust, {{lowerBound}} and 
{{upperBound}} are just used to decide partition stride, not for filtering. So 
all table rows are partitioned. So this is not a bug. But of course the 
document needs to be updated to clearly state this.


> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-14 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495564#comment-14495564
 ] 

Liang-Chi Hsieh commented on SPARK-6800:


Why the other ones will result in repeated rows being read?

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492364#comment-14492364
 ] 

Micael Capitão commented on SPARK-6800:
---

The above pull request seem to only fix the upper and lower bounds issue. There 
is still the intermediate queries issue that may result in repeated rows being 
fetched from a DB.

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492244#comment-14492244
 ] 

Apache Spark commented on SPARK-6800:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5488

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

10 matches

Site Navigation

Mail list logo

Footer information