[jira] [Created] (SPARK-38009) In start-thriftserver.sh arguments, "--hiveconf xxx" should have higher precedence over "--conf spark.hadoop.xxx", or any other hadoop configurations
Peng Cheng created SPARK-38009: -- Summary: In start-thriftserver.sh arguments, "--hiveconf xxx" should have higher precedence over "--conf spark.hadoop.xxx", or any other hadoop configurations Key: SPARK-38009 URL: https://issues.apache.org/jira/browse/SPARK-38009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 2.4.8 Environment: The above experiment is conducted on Apache Spark 2.4.7 & 3.2.0 respectively. OS: Ubuntu 20.04 Java: OpenJDK1.8.0 Reporter: Peng Cheng By convention, An Apache Hive server will read configuration options from different sources with different precedence, and the precedence of "–hiveconf" options in command line options should only be lower than those set by using the {*}set command (see [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration] for detail){*}. It should be higher than hadoop configuration, or any of the configuration files on the server (including, but not limited to hive-site.xml and core-site.xml) This convention is clearly not maintained very well by Apache Spark thrift server. As demonstrated in the following example: If I start this server with diverging option values on "hive.server2.thrift.port": ``` ./sbin/start-thriftserver.sh \ --conf spark.hadoop.hive.server2.thrift.port=10001 \ --hiveconf hive.server2.thrift.port=10002 ``` "–conf"/port 10001 will be preferred over "–hiveconf"/port 10002: ``` Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /home/xxx/spark-2.4.7-bin-hadoop2.7-scala2.12/conf/:/home/xxx/spark-2.4.7-bin-hadoop2.7-scala2.12/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --conf spark.hadoop.hive.server2.thrift.port=10001 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server spark-internal --hiveconf hive.server2.thrift.port=10002 ... 22/01/24 17:32:18 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10001 with 5...500 worker threads ``` replacing "--conf" line with an entry in core-site.xml makes no difference. I doubt if this divergence from conventional hive server behaviour is deliberate. Thus I'm calling the precedence of hive configuration options to be set to be on par or maximally similar to that of an Apache Hive server of the same version. To my knowledge, it should be: SET command > --hiveconf > hive-site.xml > hive-default.xml > --conf > core-site.xml >. core-default.xml -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32992) In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-32992: --- Description: Most JDBC drivers use long SQL type for dataset row ID: (in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils) {code:java} private def getCatalystType( sqlType: Int, precision: Int, scale: Int, signed: Boolean): DataType = { val answer = sqlType match { // scalastyle:off ... case java.sql.Types.ROWID => LongType ... case _ => throw new SQLException("Unrecognized SQL type " + sqlType) // scalastyle:on } if (answer == null) { throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName) } answer {code} Oracle JDBC drivers (of all versions) are rare exception, only String value can be extracted: (in oracle.jdbc.driver.RowidAccessor, decompiled bytecode) {code:java} ... String getString(int var1) throws SQLException { return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), this.getLength(var1), this.statement.connection.conversion.getCharacterSet((short)1)); } Object getObject(int var1) throws SQLException { return this.getROWID(var1); } ... {code} This caused an exception to be thrown when importing datasets from an Oracle DB, as reported in [https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:] {code:java} {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): java.sql.SQLException: Invalid column type: getLong not implemented for class oracle.jdbc.driver.T4CRowidAccessor at oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440) at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228) at oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}} {code} Therefore, the default SQL type => Catalyst type conversion rule should be overriden in OracleDialect. Specifically, the following rule should be added: {code:java} case Types.ROWID => Some(StringType) {code} was: Most JDBC drivers use long SQL type for dataset row ID: (in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils) {code:java} private def getCatalystType( sqlType: Int, precision: Int, scale: Int, signed: Boolean): DataType = { val answer = sqlType match { // scalastyle:off ... case java.sql.Types.ROWID => LongType ... case _ => throw new SQLException("Unrecognized SQL type " + sqlType) // scalastyle:on } if (answer == null) { throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName) } answer {code} Oracle JDBC drivers (of all versions) are rare exception, only String value can be extracted: (in oracle.jdbc.driver.RowidAccessor, decompiled bytecode) {code:java} ... String getString(int var1) throws SQLException { return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), this.getLength(var1), this.statement.connection.conversion.getCharacterSet((short)1)); } Object getObject(int var1) throws SQLException { return this.getROWID(var1); } ... {code} This caused an exception to be thrown when importing datasets from an Oracle DB, as reported in [https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:] {code:java} {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): java.sql.SQLException: Invalid column type: getLong not implemented for class oracle.jdbc.driver.T4CRowidAccessor at oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440) at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228) at oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}} {code} Therefore, the default SQL type => Catalyst type conversion rule should be overriden in OracleDialect. Specifically, the following rule should be added: {code:java} case Types.ROWID => Some(StringType) {code} > In OracleDialect, "RowID" SQL type should be
[jira] [Updated] (SPARK-32992) In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-32992: --- Description: Most JDBC drivers use long SQL type for dataset row ID: (in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils) {code:java} private def getCatalystType( sqlType: Int, precision: Int, scale: Int, signed: Boolean): DataType = { val answer = sqlType match { // scalastyle:off ... case java.sql.Types.ROWID => LongType ... case _ => throw new SQLException("Unrecognized SQL type " + sqlType) // scalastyle:on } if (answer == null) { throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName) } answer {code} Oracle JDBC drivers (of all versions) are rare exception, only String value can be extracted: (in oracle.jdbc.driver.RowidAccessor, decompiled bytecode) {code:java} ... String getString(int var1) throws SQLException { return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), this.getLength(var1), this.statement.connection.conversion.getCharacterSet((short)1)); } Object getObject(int var1) throws SQLException { return this.getROWID(var1); } ... {code} This caused an exception to be thrown when importing datasets from an Oracle DB, as reported in [https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:] {code:java} {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): java.sql.SQLException: Invalid column type: getLong not implemented for class oracle.jdbc.driver.T4CRowidAccessor at oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440) at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228) at oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}} {code} Therefore, the default SQL type => Catalyst type conversion rule should be overriden in OracleDialect. Specifically, the following rule should be added: {code:java} case Types.ROWID => Some(StringType) {code} was: Most JDBC drivers use long SQL type for dataset row ID: (in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils) ```scala private def getCatalystType( sqlType: Int, precision: Int, scale: Int, signed: Boolean): DataType = { val answer = sqlType match { // scalastyle:off ... {color:#de350b}case java.sql.Types.ROWID => LongType{color} ... case _ => throw new SQLException("Unrecognized SQL type " + sqlType) // scalastyle:on } if (answer == null) { throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName) } answer } ``` Oracle JDBC drivers (of all versions) are rare exception, only String value can be extracted: (in oracle.jdbc.driver.RowidAccessor, decompiled bytecode) ```java ... String getString(int var1) throws SQLException { return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), this.getLength(var1), this.statement.connection.conversion.getCharacterSet((short)1)); } Object getObject(int var1) throws SQLException { return this.getROWID(var1); } ... ``` This caused the common exception to be reported in [https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:] ``` {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): java.sql.SQLException: Invalid column type: getLong not implemented for class oracle.jdbc.driver.T4CRowidAccessorat oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440) at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228) at oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}} ``` Therefore, the default SQL type => Catalyst type conversion rule should be overriden in OracleDialect. Specifically, the following rule should be added: ``` case Types.ROWID => Some(StringType) ``` > In OracleDialect, "RowID" SQL type should be converted i
[jira] [Created] (SPARK-32992) In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type
Peng Cheng created SPARK-32992: -- Summary: In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type Key: SPARK-32992 URL: https://issues.apache.org/jira/browse/SPARK-32992 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.7, 3.1.0 Reporter: Peng Cheng Most JDBC drivers use long SQL type for dataset row ID: (in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils) ```scala private def getCatalystType( sqlType: Int, precision: Int, scale: Int, signed: Boolean): DataType = { val answer = sqlType match { // scalastyle:off ... {color:#de350b}case java.sql.Types.ROWID => LongType{color} ... case _ => throw new SQLException("Unrecognized SQL type " + sqlType) // scalastyle:on } if (answer == null) { throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName) } answer } ``` Oracle JDBC drivers (of all versions) are rare exception, only String value can be extracted: (in oracle.jdbc.driver.RowidAccessor, decompiled bytecode) ```java ... String getString(int var1) throws SQLException { return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), this.getLength(var1), this.statement.connection.conversion.getCharacterSet((short)1)); } Object getObject(int var1) throws SQLException { return this.getROWID(var1); } ... ``` This caused the common exception to be reported in [https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:] ``` {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): java.sql.SQLException: Invalid column type: getLong not implemented for class oracle.jdbc.driver.T4CRowidAccessorat oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440) at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228) at oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}} ``` Therefore, the default SQL type => Catalyst type conversion rule should be overriden in OracleDialect. Specifically, the following rule should be added: ``` case Types.ROWID => Some(StringType) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983635#comment-16983635 ] Peng Cheng commented on SPARK-27025: Am I too late for this issue? I submitted SPARK-29852. Do you think it is a viable solution [~srowen] ? > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-29852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-29852: --- Issue Type: Improvement (was: New Feature) > Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator > - > > Key: SPARK-29852 > URL: https://issues.apache.org/jira/browse/SPARK-29852 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Peng Cheng >Priority: Major > Original Estimate: 0h > Remaining Estimate: 0h > > Both RDD and Dataset APIs have 2 methods of collecting data from executors to > driver: > > # .collect() setup multiple threads in a job and dump all data from executor > into drivers memory. This is great if data on driver needs to be accessible > ASAP, but not as efficient if access to partitions can only happen > sequentially, and outright risky if driver doesn't have enough memory to hold > all data. > - the solution for issue SPARK-25224 partially alleviate this by delaying > deserialisation of data in InternalRow format, such that only the much > smaller serialised data needs to be entirely hold by driver memory. This > solution does not abide O(1) memory consumption, thus does not scale to > arbitrarily large dataset > # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of > the next partition does not start until sequential access to previous > partition has concluded. This action abides O(1) memory consumption and is > great if access to data is sequential and significantly slower than the speed > where partitions can be shipped from a single executor, with 1 thread. It > becomes inefficient when the sequential access to data has to wait for a > relatively long time for the shipping of the next partition > The proposed solution is a crossover between two existing implementations: a > concurrent subroutine that is both CPU and memory bounded. The solution > allocate a fixed sized resource pool (by default = number of available CPU > cores) that serves the shipping of partitions concurrently, and block > sequential access to partitions' data until shipping is finished (which > usually happens without blocking for partitionID >=2 due to the fact that > shipping start much earlier and preemptively). Tenants of the resource pool > can be GC'ed and evicted once sequential access to it's data has finished, > which allows more partitions to be fetched much earlier than they are > accessed. The maximum memory consumption is O(m * n), where m is the > predefined concurrency and n is the size of the largest partition. > The following scala code snippet demonstrates a simple implementation: > > (requires scala 2.11 + and ScalaTests) > > {code:java} > package org.apache.spark.spike > import java.util.concurrent.ArrayBlockingQueue > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.SparkSession > import org.apache.spark.{FutureAction, SparkContext} > import org.scalatest.FunSpec > import scala.concurrent.Future > import scala.language.implicitConversions > import scala.reflect.ClassTag > import scala.util.{Failure, Success, Try} > class ToLocalIteratorPreemptivelySpike extends FunSpec { > import ToLocalIteratorPreemptivelySpike._ > lazy val sc: SparkContext = > SparkSession.builder().master("local[*]").getOrCreate().sparkContext > it("can be much faster than toLocalIterator") { > val max = 80 > val delay = 100 > val slowRDD = sc.parallelize(1 to max, 8).map { v => > Thread.sleep(delay) > v > } > val (r1, t1) = timed { > slowRDD.toLocalIterator.toList > } > val capacity = 4 > val (r2, t2) = timed { > slowRDD.toLocalIteratorPreemptively(capacity).toList > } > assert(r1 == r2) > println(s"linear: $t1, preemptive: $t2") > assert(t1 > t2 * 2) > assert(t2 > max * delay / capacity) > } > } > object ToLocalIteratorPreemptivelySpike { > case class PartitionExecution[T: ClassTag]( > @transient self: RDD[T], > id: Int > ) { > def eager: this.type = { > AsArray.future > this > } > case object AsArray { > @transient lazy val future: FutureAction[Array[T]] = { > var result: Array[T] = null > val future = self.context.submitJob[T, Array[T], Array[T]]( > self, > _.toArray, > Seq(id), { (_, data) => > result = data > }, > result > ) > future > } > @transient lazy val now: Array[T] = future.get() > } > } > implicit class RDDFunctions[T: ClassTag](self: RDD[T]) { > import scala.concurrent.ExecutionContext.Implicits.global > def _toLocalIteratorPreemptively(cap
[jira] [Created] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator
Peng Cheng created SPARK-29852: -- Summary: Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator Key: SPARK-29852 URL: https://issues.apache.org/jira/browse/SPARK-29852 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Peng Cheng Both RDD and Dataset APIs have 2 methods of collecting data from executors to driver: # .collect() setup multiple threads in a job and dump all data from executor into drivers memory. This is great if data on driver needs to be accessible ASAP, but not as efficient if access to partitions can only happen sequentially, and outright risky if driver doesn't have enough memory to hold all data. - the solution for issue SPARK-25224 partially alleviate this by delaying deserialisation of data in InternalRow format, such that only the much smaller serialised data needs to be entirely hold by driver memory. This solution does not abide O(1) memory consumption, thus does not scale to arbitrarily large dataset # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of the next partition does not start until sequential access to previous partition has concluded. This action abides O(1) memory consumption and is great if access to data is sequential and significantly slower than the speed where partitions can be shipped from a single executor, with 1 thread. It becomes inefficient when the sequential access to data has to wait for a relatively long time for the shipping of the next partition The proposed solution is a crossover between two existing implementations: a concurrent subroutine that is both CPU and memory bounded. The solution allocate a fixed sized resource pool (by default = number of available CPU cores) that serves the shipping of partitions concurrently, and block sequential access to partitions' data until shipping is finished (which usually happens without blocking for partitionID >=2 due to the fact that shipping start much earlier and preemptively). Tenants of the resource pool can be GC'ed and evicted once sequential access to it's data has finished, which allows more partitions to be fetched much earlier than they are accessed. The maximum memory consumption is O(m * n), where m is the predefined concurrency and n is the size of the largest partition. The following scala code snippet demonstrates a simple implementation: (requires scala 2.11 + and ScalaTests) {code:java} package org.apache.spark.spike import java.util.concurrent.ArrayBlockingQueue import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.spark.{FutureAction, SparkContext} import org.scalatest.FunSpec import scala.concurrent.Future import scala.language.implicitConversions import scala.reflect.ClassTag import scala.util.{Failure, Success, Try} class ToLocalIteratorPreemptivelySpike extends FunSpec { import ToLocalIteratorPreemptivelySpike._ lazy val sc: SparkContext = SparkSession.builder().master("local[*]").getOrCreate().sparkContext it("can be much faster than toLocalIterator") { val max = 80 val delay = 100 val slowRDD = sc.parallelize(1 to max, 8).map { v => Thread.sleep(delay) v } val (r1, t1) = timed { slowRDD.toLocalIterator.toList } val capacity = 4 val (r2, t2) = timed { slowRDD.toLocalIteratorPreemptively(capacity).toList } assert(r1 == r2) println(s"linear: $t1, preemptive: $t2") assert(t1 > t2 * 2) assert(t2 > max * delay / capacity) } } object ToLocalIteratorPreemptivelySpike { case class PartitionExecution[T: ClassTag]( @transient self: RDD[T], id: Int ) { def eager: this.type = { AsArray.future this } case object AsArray { @transient lazy val future: FutureAction[Array[T]] = { var result: Array[T] = null val future = self.context.submitJob[T, Array[T], Array[T]]( self, _.toArray, Seq(id), { (_, data) => result = data }, result ) future } @transient lazy val now: Array[T] = future.get() } } implicit class RDDFunctions[T: ClassTag](self: RDD[T]) { import scala.concurrent.ExecutionContext.Implicits.global def _toLocalIteratorPreemptively(capacity: Int): Iterator[Array[T]] = { val executions = self.partitions.indices.map { ii => PartitionExecution(self, ii) } val buffer = new ArrayBlockingQueue[Try[PartitionExecution[T]]](capacity) Future { executions.foreach { exe => buffer.put(Success(exe)) // may be blocking due to capacity exe.eager // non-blocking } }.onFailure { case e: Throwable => buffer.put(Failure(e)) } self
[jira] [Commented] (SPARK-6473) Launcher lib shouldn't try to figure out Scala version when not in dev mode
[ https://issues.apache.org/jira/browse/SPARK-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279355#comment-16279355 ] Peng Cheng commented on SPARK-6473: --- Looks like this issue reappear at some point: getScalaVersion() will be called anyway even if not in dev/test mode: https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L196 It was suppressed simply because SPARK_SCALA_VERSION is always set by shell script. Should we add a test to ensure that it never happens? > Launcher lib shouldn't try to figure out Scala version when not in dev mode > --- > > Key: SPARK-6473 > URL: https://issues.apache.org/jira/browse/SPARK-6473 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 1.4.0 > > > Thanks to [~nravi] for pointing this out. > The launcher library currently always tries to figure out what's the build's > scala version, even when it's not needed. That code is only used when setting > some dev options, and relies on the layout of the build directories, so it > doesn't work with the directory layout created by make-distribution.sh. > Right now this works on Linux because bin/load-spark-env.sh sets the Scala > version explicitly, but it would break the distribution on Windows, for > example. > Fix is pretty straight-forward. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174442#comment-15174442 ] Peng Cheng commented on SPARK-7481: --- +1 Me four > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073036#comment-15073036 ] Peng Cheng commented on SPARK-10625: I think the pull request has been merged. Can it be closed now? > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059390#comment-15059390 ] Peng Cheng commented on SPARK-10625: https://github.com/apache/spark/pull/8785 this one :) others are already withdrawn, forgive me for not knowing the rule for pull requests. > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030259#comment-15030259 ] Peng Cheng commented on SPARK-10625: squash rebased on master > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030257#comment-15030257 ] Peng Cheng commented on SPARK-10625: Rebased on 1.6 branch > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030256#comment-15030256 ] Peng Cheng commented on SPARK-10625: rebased on 1.5 branch > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030087#comment-15030087 ] Peng Cheng commented on SPARK-10625: Yes it is 2 months behind master branch. I've just rebased on master and patched all unit test errors (a dependency used by the new JDBCRDD now validates if all connection properties can be casted into String, so the deep copy is now almost used anywhere to correct some particularly undisciplined JDBC driver implementations). Is it mergeable now? If you like I can rebase on 1.6 branch and issue a pull request shortly. > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029361#comment-15029361 ] Peng Cheng commented on SPARK-10625: Hi committers, after 1.5.2 release this is still not merged, now this bugfix is blocking us from committing other features (namely JDBC dialects for several databases). Can you prompt me what should be done next to make it through the next release? (resolve conflict? add more tests? should be easy). Thanks a lot pals --Yours Peng > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950901#comment-14950901 ] Peng Cheng commented on SPARK-7708: --- Also, this problem doesn't only affect ClosureSerializer, if a class contains SerializableWritable[org.apache.hadoop.conf.Configuration] it will also throw the same unserialization exception > Incorrect task serialization with Kryo closure serializer > - > > Key: SPARK-7708 > URL: https://issues.apache.org/jira/browse/SPARK-7708 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.2 >Reporter: Akshat Aranya > > I've been investigating the use of Kryo for closure serialization with Spark > 1.2, and it seems like I've hit upon a bug: > When a task is serialized before scheduling, the following log message is > generated: > [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, > , PROCESS_LOCAL, 302 bytes) > This message comes from TaskSetManager which serializes the task using the > closure serializer. Before the message is sent out, the TaskDescription > (which included the original task as a byte array), is serialized again into > a byte array with the closure serializer. I added a log message for this in > CoarseGrainedSchedulerBackend, which produces the following output: > [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 > The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ > than serialized task that it contains (302 bytes). This implies that > TaskDescription.buffer is not getting serialized correctly. > On the executor side, the deserialization produces a null value for > TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950886#comment-14950886 ] Peng Cheng commented on SPARK-7708: --- This problem is affecting me and so far there is no trace whether it will appear (sometimes SerializableWritable can be serialized smoothly). Does anybody know how to bypass it? > Incorrect task serialization with Kryo closure serializer > - > > Key: SPARK-7708 > URL: https://issues.apache.org/jira/browse/SPARK-7708 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.2 >Reporter: Akshat Aranya > > I've been investigating the use of Kryo for closure serialization with Spark > 1.2, and it seems like I've hit upon a bug: > When a task is serialized before scheduling, the following log message is > generated: > [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, > , PROCESS_LOCAL, 302 bytes) > This message comes from TaskSetManager which serializes the task using the > closure serializer. Before the message is sent out, the TaskDescription > (which included the original task as a byte array), is serialized again into > a byte array with the closure serializer. I added a log message for this in > CoarseGrainedSchedulerBackend, which produces the following output: > [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 > The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ > than serialized task that it contains (302 bytes). This implies that > TaskDescription.buffer is not getting serialized correctly. > On the executor side, the deserialization produces a null value for > TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941775#comment-14941775 ] Peng Cheng commented on SPARK-10625: The patch can be merged immediately, can some one verify and merge it? > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900980#comment-14900980 ] Peng Cheng commented on SPARK-10625: Exactly as you described, driver can always be initialized from username, password and url, the other objects are for resource pooling or saving object creation costs. Not sure if it applies to all drivers, so far I haven't seen a counter-example > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791252#comment-14791252 ] Peng Cheng commented on SPARK-10625: A pull request has been send that contains 2 extra unit tests and a simple fix: https://github.com/apache/spark/pull/8785 Can you help me validating it and merge in 1.5.1? > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
[ https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746450#comment-14746450 ] Peng Cheng commented on SPARK-10625: branch: https://github.com/Schedule1/spark/tree/SPARK-10625 will submit a pull request after I fixed all tests > Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds > unserializable objects into connection properties > -- > > Key: SPARK-10625 > URL: https://issues.apache.org/jira/browse/SPARK-10625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 > Environment: Ubuntu 14.04 >Reporter: Peng Cheng > Labels: jdbc, spark, sparksql > > Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by > adding new objects into the connection properties, which is then reused by > Spark to be deployed to workers. When some of these new objects are unable to > be serializable it will trigger an org.apache.spark.SparkException: Task not > serializable. The following test code snippet demonstrate this problem by > using a modified H2 driver: > test("INSERT to JDBC Datasource with UnserializableH2Driver") { > object UnserializableH2Driver extends org.h2.Driver { > override def connect(url: String, info: Properties): Connection = { > val result = super.connect(url, info) > info.put("unserializableDriver", this) > result > } > override def getParentLogger: Logger = ??? > } > import scala.collection.JavaConversions._ > val oldDrivers = > DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq > oldDrivers.foreach{ > DriverManager.deregisterDriver > } > DriverManager.registerDriver(UnserializableH2Driver) > sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) > assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", > properties).collect()(0).length) > DriverManager.deregisterDriver(UnserializableH2Driver) > oldDrivers.foreach{ > DriverManager.registerDriver > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties
Peng Cheng created SPARK-10625: -- Summary: Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties Key: SPARK-10625 URL: https://issues.apache.org/jira/browse/SPARK-10625 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0, 1.4.1 Environment: Ubuntu 14.04 Reporter: Peng Cheng Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by adding new objects into the connection properties, which is then reused by Spark to be deployed to workers. When some of these new objects are unable to be serializable it will trigger an org.apache.spark.SparkException: Task not serializable. The following test code snippet demonstrate this problem by using a modified H2 driver: test("INSERT to JDBC Datasource with UnserializableH2Driver") { object UnserializableH2Driver extends org.h2.Driver { override def connect(url: String, info: Properties): Connection = { val result = super.connect(url, info) info.put("unserializableDriver", this) result } override def getParentLogger: Logger = ??? } import scala.collection.JavaConversions._ val oldDrivers = DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq oldDrivers.foreach{ DriverManager.deregisterDriver } DriverManager.registerDriver(UnserializableH2Driver) sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE") assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count) assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).collect()(0).length) DriverManager.deregisterDriver(UnserializableH2Driver) oldDrivers.foreach{ DriverManager.registerDriver } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703585#comment-14703585 ] Peng Cheng commented on SPARK-7481: --- Thanks a lot! A long run to the end. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582965#comment-14582965 ] Peng Cheng commented on SPARK-7442: --- Still not fixed in 1.4.0 ... reverting to hadoop 2.4 until this is resolved. > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557820#comment-14557820 ] Peng Cheng edited comment on SPARK-7442 at 5/24/15 6:55 PM: Adding jar won't solve the problem: you need to set the following parameters: --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem But in my 2.6 environment the fs implementation in the added jar is ignored by worker's classloader for unknow reason, see http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar was (Author: peng): Adding jar won't solve the problem: you need to set the following parameters: --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem But in my 2.6 environment the added jar is ignored by worker's classloader for unknow reason, see http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557820#comment-14557820 ] Peng Cheng commented on SPARK-7442: --- Adding jar won't solve the problem: you need to set the following parameters: --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem But in my 2.6 environment the added jar is ignored by worker's classloader for unknow reason, see http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557549#comment-14557549 ] Peng Cheng commented on SPARK-7481: --- I've tried to do it but I get a lot of headaches, as aws toolkit is using an outdate jackson library. Though this feature is indeed blocking me from upgrading to hadoop 2.6. So I guess its important > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7043) KryoSerializer cannot be used with REPL to interpret code in which case class definition and its shipping are in the same line
[ https://issues.apache.org/jira/browse/SPARK-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507318#comment-14507318 ] Peng Cheng commented on SPARK-7043: --- Sorry, didn't read all the issues. Some of them may have the same cause. > KryoSerializer cannot be used with REPL to interpret code in which case class > definition and its shipping are in the same line > -- > > Key: SPARK-7043 > URL: https://issues.apache.org/jira/browse/SPARK-7043 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.1 > Environment: Ubuntu 14.04, no hadoop >Reporter: Peng Cheng >Priority: Minor > Labels: classloader, kryo > Original Estimate: 48h > Remaining Estimate: 48h > > When deploying Spark-shell with > "spark.serializer=org.apache.spark.serializer.KryoSerializer" option. > Spark-shell cannot execute the following code (in 1 line): > case class Foo(i: Int);val ret = sc.parallelize((1 to 100).map(Foo), > 10).collect() > This problem won't exist for either JavaSerializer or code splitted into 2 > lines. The only possible explanation is that KryoSerializer is using a > ClassLoader that is not registered as an subsidiary ClassLoader of the one in > REPL. > A "dirty" fix would be just breaking input by semicolon, but its better to > fix the ClassLoader to avoid other liabilities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7043) KryoSerializer cannot be used with REPL to interpret code in which case class definition and its shipping are in the same line
Peng Cheng created SPARK-7043: - Summary: KryoSerializer cannot be used with REPL to interpret code in which case class definition and its shipping are in the same line Key: SPARK-7043 URL: https://issues.apache.org/jira/browse/SPARK-7043 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.1 Environment: Ubuntu 14.04, no hadoop Reporter: Peng Cheng When deploying Spark-shell with "spark.serializer=org.apache.spark.serializer.KryoSerializer" option. Spark-shell cannot execute the following code (in 1 line): case class Foo(i: Int);val ret = sc.parallelize((1 to 100).map(Foo), 10).collect() This problem won't exist for either JavaSerializer or code splitted into 2 lines. The only possible explanation is that KryoSerializer is using a ClassLoader that is not registered as an subsidiary ClassLoader of the one in REPL. A "dirty" fix would be just breaking input by semicolon, but its better to fix the ClassLoader to avoid other liabilities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385029#comment-14385029 ] Peng Cheng commented on SPARK-4590: --- Hi Reza, that's great news. Can someone point us to the thread referring the `IndexedRDD`? > Early investigation of parameter server > --- > > Key: SPARK-4590 > URL: https://issues.apache.org/jira/browse/SPARK-4590 > Project: Spark > Issue Type: Brainstorming > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Reza Zadeh > > In the currently implementation of GLM solvers, we save intermediate models > on the driver node and update it through broadcast and aggregation. Even with > torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond > ~10 million features. This JIRA is for investigating the parameter server > approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349760#comment-14349760 ] Peng Cheng commented on SPARK-4925: --- Me too, though the patch is applied in 1.2.1. We just need the last step > Publish Spark SQL hive-thriftserver maven artifact > --- > > Key: SPARK-4925 > URL: https://issues.apache.org/jira/browse/SPARK-4925 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 1.2.0 >Reporter: Alex Liu > Fix For: 1.3.0, 1.2.1 > > > The hive-thriftserver maven artifact is needed for integrating Spark SQL with > Cassandra. > Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263681#comment-14263681 ] Peng Cheng commented on SPARK-4923: --- You are right, in fact 'Dev's API' simply means method is susceptible to changes without deprecation or notice, which the main 3 markings will be least likely to undergo. Could you please edit the patch and add more markings? > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261185#comment-14261185 ] Peng Cheng commented on SPARK-4923: --- Hey, a guy waiting for it was complaining that the version number of spark-repl_2.10 on typesafe repo is actually 1.2.0-SNAPSHOT, not the released 1.2.0 Is it done on purpose? > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-4923: -- Target Version/s: 1.3.0, 1.2.1 (was: 1.3.0) > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-4923: -- Attachment: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch thank you so much! First patch uploaded > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256322#comment-14256322 ] Peng Cheng commented on SPARK-4923: --- Sorry my project is https://github.com/tribbloid/ISpark > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256319#comment-14256319 ] Peng Cheng commented on SPARK-4923: --- Hey Patrick, The following API has been integrated since 1.0.0, IMHO they are stable enough for daily prototyping, creating case class used to be defective but has been fixed long time ago. SparkILoop.getAddedJars() $SparkIMain.bind $SparkIMain.quietBind $SparkIMain.interpret end of :) At first I assume that further development on it has been moved to databricks cloud. But the JIRA ticket was already there in September. So maybe demand on this API from the community is indeed low enough. However, I would still suggest keeping it, even promoting it into a Developer's API, this would encourage more projects to integrate in a more flexible way, and save prototyping/QA cost by customizing fixtures of REPL. People will still move to databricks cloud, which has far more features than that. Many influential projects already depends on the routinely published Scala-REPL (e.g. playFW), it would be strange for Spark not doing the same. What do you think? > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4923) Maven build should keep publishing spark-repl
Peng Cheng created SPARK-4923: - Summary: Maven build should keep publishing spark-repl Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256228#comment-14256228 ] Peng Cheng commented on SPARK-3452: --- IMHO REPL should be kept being published, one of my project extends its API to initialize some third-party components upon launching. This should be made an 'official' API to encourage more platform integrate with it. > Maven build should skip publishing artifacts people shouldn't depend on > --- > > Key: SPARK-3452 > URL: https://issues.apache.org/jira/browse/SPARK-3452 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0, 1.1.0 >Reporter: Patrick Wendell >Assignee: Prashant Sharma >Priority: Critical > Fix For: 1.2.0 > > > I think it's easy to do this by just adding a skip configuration somewhere. > We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1270) An optimized gradient descent implementation
[ https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156964#comment-14156964 ] Peng Cheng commented on SPARK-1270: --- Yo, any follow up story on this one? I'm curious to know the local update part, as DistBelief has non-local model server shards. > An optimized gradient descent implementation > > > Key: SPARK-1270 > URL: https://issues.apache.org/jira/browse/SPARK-1270 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Xusen Yin > Labels: GradientDescent, MLLib, > Fix For: 1.0.0 > > > Current implementation of GradientDescent is inefficient in some aspects, > especially in high-latency network. I propose a new implementation of > GradientDescent, which follows a parallelism model called > GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric > Xing's SSP. With a few modifications of runMiniBatchSGD, the > GradientDescentWithLocalUpdate can outperform the original sequential version > by about 4x without sacrificing accuracy, and can be easily adopted by most > classification and regression algorithms in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2083) Allow local task to retry after failure.
Peng Cheng created SPARK-2083: - Summary: Allow local task to retry after failure. Key: SPARK-2083 URL: https://issues.apache.org/jira/browse/SPARK-2083 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0 Reporter: Peng Cheng Priority: Trivial If a job is submitted to run locally using masterURL = "local[X]", spark will not retry a failed task regardless of your "spark.task.maxFailures" setting. This design is to facilitate debugging and QA of spark application where all tasks are expected to succeed and yield a results. Unfortunately, such setting will prevent a local job from finished if any of its task cannot guarantee a result (e.g. visiting an external resouce/API), and retrying inside the task is less favoured (e.g. the task needs to be executed on a different computer on production). User however can still set masterURL ="local[X,Y]" to override this (where Y is the local maxFailures), but it is not documented and hard to manage. A quick fix to this can be to add a new configuration property "spark.local.maxFailures" with a default value of 1. So user knows exactly where to change when reading the documentation -- This message was sent by Atlassian JIRA (v6.2#6252)