[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878443#comment-15878443 ] Sean Owen commented on SPARK-19692: --- Bytes are signed in the JVM, and thus in Scala and Java. It's always been this way everywhere and isn't specific to Spark. 0x8C is a way of writing -116, not a positive value. > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878431#comment-15878431 ] Don Smith commented on SPARK-19692: an even more trivial example: {code} val sc = SparkSession.builder.appName("test").getOrCreate() val schema = StructType(Seq(StructField("byte", BinaryType))) val byte = Seq(Array(0x8C.toByte)) val df = sc.createDataFrame( sc.sparkContext.parallelize(byte, 1).map { ip => SQLRow(ip) }, schema ) logger.info(df.show) val query = df .where(df("byte") >= Array(0x00.toByte)) .where(df("byte") <= Array(0xFF.toByte)) logger.info(query.explain(true)) val results = query.collect() results.length mustEqual 1 {code} i'm having trouble believing this is the expected behavior, and if it is, is it defined somewhere? > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878202#comment-15878202 ] Sean Owen commented on SPARK-19692: --- That doesn't sound like a bug. Bytes are signed in Java. If you want to interpret them otherwise you'd need to convert them or provide a different comparison. > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877869#comment-15877869 ] Takeshi Yamamuro commented on SPARK-19692: -- ISTM the query is correct and am I missing? {code} scala> import java.net.InetAddress scala> import org.apache.spark.sql.types._ scala> val df = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(d => Tuple1(InetAddress.getByName(d).getAddress)).toDF("ip") df: org.apache.spark.sql.DataFrame = [ip: binary] scala> df.where($"ip" >= InetAddress.getByName("200.10.0.0").getAddress).show +-+ | ip| +-+ |[01 01 01 01]| |[02 02 02 02]| |[C8 0A 06 07]| +-+ scala> df.where($"ip" <= InetAddress.getByName("200.10.255.255").getAddress).show +---+ | ip| +---+ +---+ {code} > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org