[jira] [Created] (SPARK-38009) In start-thriftserver.sh arguments, "--hiveconf xxx" should have higher precedence over "--conf spark.hadoop.xxx", or any other hadoop configurations

2022-01-24 Thread Peng Cheng (Jira)
Peng Cheng created SPARK-38009:
--

 Summary: In start-thriftserver.sh arguments, "--hiveconf xxx" 
should have higher precedence over "--conf spark.hadoop.xxx", or any other 
hadoop configurations
 Key: SPARK-38009
 URL: https://issues.apache.org/jira/browse/SPARK-38009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 2.4.8
 Environment: The above experiment is conducted on Apache Spark 2.4.7 & 
3.2.0 respectively.

 

OS: Ubuntu 20.04

Java: OpenJDK1.8.0

 
Reporter: Peng Cheng


By convention, An Apache Hive server will read configuration options from 
different sources with different precedence, and the precedence of "–hiveconf" 
options in command line options should only be lower than those set by using 
the {*}set command (see 
[https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration] 
for detail){*}. It should be higher than hadoop configuration, or any of the 
configuration files on the server (including, but not limited to hive-site.xml 
and core-site.xml)

This convention is clearly not maintained very well by Apache Spark thrift 
server. As demonstrated in the following example: If I start this server with 
diverging option values on "hive.server2.thrift.port":

 

```
./sbin/start-thriftserver.sh \
--conf spark.hadoop.hive.server2.thrift.port=10001 \
--hiveconf hive.server2.thrift.port=10002
```

 

"–conf"/port 10001 will be preferred over "–hiveconf"/port 10002:

 

```

Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp 
/home/xxx/spark-2.4.7-bin-hadoop2.7-scala2.12/conf/:/home/xxx/spark-2.4.7-bin-hadoop2.7-scala2.12/jars/*
 -Xmx1g org.apache.spark.deploy.SparkSubmit --conf 
spark.hadoop.hive.server2.thrift.port=10001 --class 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift 
JDBC/ODBC Server spark-internal --hiveconf hive.server2.thrift.port=10002

...
22/01/24 17:32:18 INFO ThriftCLIService: Starting ThriftBinaryCLIService on 
port 10001 with 5...500 worker threads

```

 

replacing "--conf" line with an entry in core-site.xml makes no difference.

I doubt if this divergence from conventional hive server behaviour is 
deliberate. Thus I'm calling the precedence of hive configuration options to be 
set to be on par or maximally similar to that of an Apache Hive server of the 
same version. To my knowledge, it should be:

 

SET command > --hiveconf > hive-site.xml > hive-default.xml > --conf > 
core-site.xml >. core-default.xml



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32992) In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type

2020-09-24 Thread Peng Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-32992:
---
Description: 
Most JDBC drivers use long SQL type for dataset row ID:

 

(in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)
{code:java}
private def getCatalystType(
 sqlType: Int,
 precision: Int,
 scale: Int,
 signed: Boolean): DataType = {
 val answer = sqlType match {
 // scalastyle:off
 ...
 case java.sql.Types.ROWID => LongType
...
 case _ =>
 throw new SQLException("Unrecognized SQL type " + sqlType)
 // scalastyle:on
 }
if (answer == null)
{ throw new SQLException("Unsupported type " + 
JDBCType.valueOf(sqlType).getName) }
answer
{code}
 

Oracle JDBC drivers (of all versions) are rare exception, only String value can 
be extracted:

 

(in oracle.jdbc.driver.RowidAccessor, decompiled bytecode)
{code:java}
...
String getString(int var1) throws SQLException
{ return this.isNull(var1) ? null : 
this.rowData.getString(this.getOffset(var1), this.getLength(var1), 
this.statement.connection.conversion.getCharacterSet((short)1)); }
Object getObject(int var1) throws SQLException
{ return this.getROWID(var1); }
...
{code}
 

This caused an exception to be thrown when importing datasets from an Oracle 
DB, as reported in 
[https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:]
{code:java}
 
 {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 
(TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): 
java.sql.SQLException: Invalid column type: getLong not implemented for class 
oracle.jdbc.driver.T4CRowidAccessor at 
oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440)
 at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228)
 at 
oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}}
 
{code}
 

Therefore, the default SQL type => Catalyst type conversion rule should be 
overriden in OracleDialect. Specifically, the following rule should be added:
{code:java}
case Types.ROWID => Some(StringType)
{code}
 

  was:
Most JDBC drivers use long SQL type for dataset row ID:

 

(in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

 
{code:java}
private def getCatalystType(
 sqlType: Int,
 precision: Int,
 scale: Int,
 signed: Boolean): DataType = {
 val answer = sqlType match {
 // scalastyle:off
 ...
 case java.sql.Types.ROWID => LongType
...
 case _ =>
 throw new SQLException("Unrecognized SQL type " + sqlType)
 // scalastyle:on
 }
if (answer == null)
{ throw new SQLException("Unsupported type " + 
JDBCType.valueOf(sqlType).getName) }
answer
{code}
 

 

Oracle JDBC drivers (of all versions) are rare exception, only String value can 
be extracted:

 

(in oracle.jdbc.driver.RowidAccessor, decompiled bytecode)
{code:java}
...
String getString(int var1) throws SQLException
{ return this.isNull(var1) ? null : 
this.rowData.getString(this.getOffset(var1), this.getLength(var1), 
this.statement.connection.conversion.getCharacterSet((short)1)); }
Object getObject(int var1) throws SQLException
{ return this.getROWID(var1); }
...
{code}
 

 

 

This caused an exception to be thrown when importing datasets from an Oracle 
DB, as reported in 
[https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:]
{code:java}
 
 {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 
(TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): 
java.sql.SQLException: Invalid column type: getLong not implemented for class 
oracle.jdbc.driver.T4CRowidAccessor at 
oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440)
 at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228)
 at 
oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}}
 
{code}
 

Therefore, the default SQL type => Catalyst type conversion rule should be 
overriden in OracleDialect. Specifically, the following rule should be added:

 
{code:java}
case Types.ROWID => Some(StringType)
{code}
 


> In OracleDialect, "RowID" SQL type should be

[jira] [Updated] (SPARK-32992) In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type

2020-09-24 Thread Peng Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-32992:
---
Description: 
Most JDBC drivers use long SQL type for dataset row ID:

 

(in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

 
{code:java}
private def getCatalystType(
 sqlType: Int,
 precision: Int,
 scale: Int,
 signed: Boolean): DataType = {
 val answer = sqlType match {
 // scalastyle:off
 ...
 case java.sql.Types.ROWID => LongType
...
 case _ =>
 throw new SQLException("Unrecognized SQL type " + sqlType)
 // scalastyle:on
 }
if (answer == null)
{ throw new SQLException("Unsupported type " + 
JDBCType.valueOf(sqlType).getName) }
answer
{code}
 

 

Oracle JDBC drivers (of all versions) are rare exception, only String value can 
be extracted:

 

(in oracle.jdbc.driver.RowidAccessor, decompiled bytecode)
{code:java}
...
String getString(int var1) throws SQLException
{ return this.isNull(var1) ? null : 
this.rowData.getString(this.getOffset(var1), this.getLength(var1), 
this.statement.connection.conversion.getCharacterSet((short)1)); }
Object getObject(int var1) throws SQLException
{ return this.getROWID(var1); }
...
{code}
 

 

 

This caused an exception to be thrown when importing datasets from an Oracle 
DB, as reported in 
[https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:]
{code:java}
 
 {{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 
(TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): 
java.sql.SQLException: Invalid column type: getLong not implemented for class 
oracle.jdbc.driver.T4CRowidAccessor at 
oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440)
 at oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228)
 at 
oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}}
 
{code}
 

Therefore, the default SQL type => Catalyst type conversion rule should be 
overriden in OracleDialect. Specifically, the following rule should be added:

 
{code:java}
case Types.ROWID => Some(StringType)
{code}
 

  was:
Most JDBC drivers use long SQL type for dataset row ID:

 

(in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

 

```scala

private def getCatalystType(
 sqlType: Int,
 precision: Int,
 scale: Int,
 signed: Boolean): DataType = {
 val answer = sqlType match {
 // scalastyle:off
...
 {color:#de350b}case java.sql.Types.ROWID => LongType{color}

...
 case _ =>
 throw new SQLException("Unrecognized SQL type " + sqlType)
 // scalastyle:on
 }

 if (answer == null) {
 throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName)
 }
 answer
}

```

 

Oracle JDBC drivers (of all versions) are rare exception, only String value can 
be extracted:

 

(in oracle.jdbc.driver.RowidAccessor, decompiled bytecode)

 

```java

...

String getString(int var1) throws SQLException {
 return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), 
this.getLength(var1), 
this.statement.connection.conversion.getCharacterSet((short)1));
}

Object getObject(int var1) throws SQLException {
 return this.getROWID(var1);
}

...

```

 

This caused the common exception to be reported in 
[https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:]

 

```
 
{{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 
(TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): 
java.sql.SQLException: Invalid column type: getLong not implemented for class 
oracle.jdbc.driver.T4CRowidAccessorat 
oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440)
at 
oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228)
at 
oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}}

```

 

Therefore, the default SQL type => Catalyst type conversion rule should be 
overriden in OracleDialect. Specifically, the following rule should be added:

 

```
case Types.ROWID => Some(StringType)

```


> In OracleDialect, "RowID" SQL type should be converted i

[jira] [Created] (SPARK-32992) In OracleDialect, "RowID" SQL type should be converted into "String" Catalyst type

2020-09-24 Thread Peng Cheng (Jira)
Peng Cheng created SPARK-32992:
--

 Summary: In OracleDialect, "RowID" SQL type should be converted 
into "String" Catalyst type
 Key: SPARK-32992
 URL: https://issues.apache.org/jira/browse/SPARK-32992
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.7, 3.1.0
Reporter: Peng Cheng


Most JDBC drivers use long SQL type for dataset row ID:

 

(in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

 

```scala

private def getCatalystType(
 sqlType: Int,
 precision: Int,
 scale: Int,
 signed: Boolean): DataType = {
 val answer = sqlType match {
 // scalastyle:off
...
 {color:#de350b}case java.sql.Types.ROWID => LongType{color}

...
 case _ =>
 throw new SQLException("Unrecognized SQL type " + sqlType)
 // scalastyle:on
 }

 if (answer == null) {
 throw new SQLException("Unsupported type " + JDBCType.valueOf(sqlType).getName)
 }
 answer
}

```

 

Oracle JDBC drivers (of all versions) are rare exception, only String value can 
be extracted:

 

(in oracle.jdbc.driver.RowidAccessor, decompiled bytecode)

 

```java

...

String getString(int var1) throws SQLException {
 return this.isNull(var1) ? null : this.rowData.getString(this.getOffset(var1), 
this.getLength(var1), 
this.statement.connection.conversion.getCharacterSet((short)1));
}

Object getObject(int var1) throws SQLException {
 return this.getROWID(var1);
}

...

```

 

This caused the common exception to be reported in 
[https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid:]

 

```
 
{{18/09/08 11:38:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 
(TID 23, gbrdsr02985.intranet.barcapint.com, executor 21): 
java.sql.SQLException: Invalid column type: getLong not implemented for class 
oracle.jdbc.driver.T4CRowidAccessorat 
oracle.jdbc.driver.GeneratedAccessor.getLong(GeneratedAccessor.java:440)
at 
oracle.jdbc.driver.GeneratedStatement.getLong(GeneratedStatement.java:228)
at 
oracle.jdbc.driver.GeneratedScrollableResultSet.getLong(GeneratedScrollableResultSet.java:620)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)}}

```

 

Therefore, the default SQL type => Catalyst type conversion rule should be 
overriden in OracleDialect. Specifically, the following rule should be added:

 

```
case Types.ROWID => Some(StringType)

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-11-27 Thread Peng Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983635#comment-16983635
 ] 

Peng Cheng commented on SPARK-27025:


Am I too late for this issue?

I submitted SPARK-29852. Do you think it is a viable solution [~srowen] ?

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator

2019-11-11 Thread Peng Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-29852:
---
Issue Type: Improvement  (was: New Feature)

> Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator
> -
>
> Key: SPARK-29852
> URL: https://issues.apache.org/jira/browse/SPARK-29852
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Peng Cheng
>Priority: Major
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Both RDD and Dataset APIs have 2 methods of collecting data from executors to 
> driver:
>  
>  # .collect() setup multiple threads in a job and dump all data from executor 
> into drivers memory. This is great if data on driver needs to be accessible 
> ASAP, but not as efficient if access to partitions can only happen 
> sequentially, and outright risky if driver doesn't have enough memory to hold 
> all data.
> - the solution for issue SPARK-25224 partially alleviate this by delaying 
> deserialisation of data in InternalRow format, such that only the much 
> smaller serialised data needs to be entirely hold by driver memory. This 
> solution does not abide O(1) memory consumption, thus does not scale to 
> arbitrarily large dataset
>  # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of 
> the next partition does not start until sequential access to previous 
> partition has concluded. This action abides O(1) memory consumption and is 
> great if access to data is sequential and significantly slower than the speed 
> where partitions can be shipped from a single executor, with 1 thread. It 
> becomes inefficient when the sequential access to data has to wait for a 
> relatively long time for the shipping of the next partition
> The proposed solution is a crossover between two existing implementations: a 
> concurrent subroutine that is both CPU and memory bounded. The solution 
> allocate a fixed sized resource pool (by default = number of available CPU 
> cores) that serves the shipping of partitions concurrently, and block 
> sequential access to partitions' data until shipping is finished (which 
> usually happens without blocking for partitionID >=2 due to the fact that 
> shipping start much earlier and preemptively). Tenants of the resource pool 
> can be GC'ed and evicted once sequential access to it's data has finished, 
> which allows more partitions to be fetched much earlier than they are 
> accessed. The maximum memory consumption is O(m * n), where m is the 
> predefined concurrency and n is the size of the largest partition.
> The following scala code snippet demonstrates a simple implementation:
>  
> (requires scala 2.11 + and ScalaTests)
>  
> {code:java}
> package org.apache.spark.spike
> import java.util.concurrent.ArrayBlockingQueue
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{FutureAction, SparkContext}
> import org.scalatest.FunSpec
> import scala.concurrent.Future
> import scala.language.implicitConversions
> import scala.reflect.ClassTag
> import scala.util.{Failure, Success, Try}
> class ToLocalIteratorPreemptivelySpike extends FunSpec {
>   import ToLocalIteratorPreemptivelySpike._
>   lazy val sc: SparkContext = 
> SparkSession.builder().master("local[*]").getOrCreate().sparkContext
>   it("can be much faster than toLocalIterator") {
> val max = 80
> val delay = 100
> val slowRDD = sc.parallelize(1 to max, 8).map { v =>
>   Thread.sleep(delay)
>   v
> }
> val (r1, t1) = timed {
>   slowRDD.toLocalIterator.toList
> }
> val capacity = 4
> val (r2, t2) = timed {
>   slowRDD.toLocalIteratorPreemptively(capacity).toList
> }
> assert(r1 == r2)
> println(s"linear: $t1, preemptive: $t2")
> assert(t1 > t2 * 2)
> assert(t2 > max * delay / capacity)
>   }
> }
> object ToLocalIteratorPreemptivelySpike {
>   case class PartitionExecution[T: ClassTag](
>   @transient self: RDD[T],
>   id: Int
>   ) {
> def eager: this.type = {
>   AsArray.future
>   this
> }
> case object AsArray {
>   @transient lazy val future: FutureAction[Array[T]] = {
> var result: Array[T] = null
> val future = self.context.submitJob[T, Array[T], Array[T]](
>   self,
>   _.toArray,
>   Seq(id), { (_, data) =>
> result = data
>   },
>   result
> )
> future
>   }
>   @transient lazy val now: Array[T] = future.get()
> }
>   }
>   implicit class RDDFunctions[T: ClassTag](self: RDD[T]) {
> import scala.concurrent.ExecutionContext.Implicits.global
> def _toLocalIteratorPreemptively(cap

[jira] [Created] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator

2019-11-11 Thread Peng Cheng (Jira)
Peng Cheng created SPARK-29852:
--

 Summary: Implement parallel preemptive RDD.toLocalIterator and 
Dataset.toLocalIterator
 Key: SPARK-29852
 URL: https://issues.apache.org/jira/browse/SPARK-29852
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Peng Cheng


Both RDD and Dataset APIs have 2 methods of collecting data from executors to 
driver:

 
 # .collect() setup multiple threads in a job and dump all data from executor 
into drivers memory. This is great if data on driver needs to be accessible 
ASAP, but not as efficient if access to partitions can only happen 
sequentially, and outright risky if driver doesn't have enough memory to hold 
all data.

- the solution for issue SPARK-25224 partially alleviate this by delaying 
deserialisation of data in InternalRow format, such that only the much smaller 
serialised data needs to be entirely hold by driver memory. This solution does 
not abide O(1) memory consumption, thus does not scale to arbitrarily large 
dataset
 # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of 
the next partition does not start until sequential access to previous partition 
has concluded. This action abides O(1) memory consumption and is great if 
access to data is sequential and significantly slower than the speed where 
partitions can be shipped from a single executor, with 1 thread. It becomes 
inefficient when the sequential access to data has to wait for a relatively 
long time for the shipping of the next partition

The proposed solution is a crossover between two existing implementations: a 
concurrent subroutine that is both CPU and memory bounded. The solution 
allocate a fixed sized resource pool (by default = number of available CPU 
cores) that serves the shipping of partitions concurrently, and block 
sequential access to partitions' data until shipping is finished (which usually 
happens without blocking for partitionID >=2 due to the fact that shipping 
start much earlier and preemptively). Tenants of the resource pool can be GC'ed 
and evicted once sequential access to it's data has finished, which allows more 
partitions to be fetched much earlier than they are accessed. The maximum 
memory consumption is O(m * n), where m is the predefined concurrency and n is 
the size of the largest partition.

The following scala code snippet demonstrates a simple implementation:

 

(requires scala 2.11 + and ScalaTests)

 
{code:java}
package org.apache.spark.spike

import java.util.concurrent.ArrayBlockingQueue

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.{FutureAction, SparkContext}
import org.scalatest.FunSpec

import scala.concurrent.Future
import scala.language.implicitConversions
import scala.reflect.ClassTag
import scala.util.{Failure, Success, Try}

class ToLocalIteratorPreemptivelySpike extends FunSpec {

  import ToLocalIteratorPreemptivelySpike._

  lazy val sc: SparkContext = 
SparkSession.builder().master("local[*]").getOrCreate().sparkContext

  it("can be much faster than toLocalIterator") {

val max = 80
val delay = 100

val slowRDD = sc.parallelize(1 to max, 8).map { v =>
  Thread.sleep(delay)
  v
}

val (r1, t1) = timed {
  slowRDD.toLocalIterator.toList
}

val capacity = 4
val (r2, t2) = timed {
  slowRDD.toLocalIteratorPreemptively(capacity).toList
}

assert(r1 == r2)
println(s"linear: $t1, preemptive: $t2")
assert(t1 > t2 * 2)
assert(t2 > max * delay / capacity)
  }
}

object ToLocalIteratorPreemptivelySpike {

  case class PartitionExecution[T: ClassTag](
  @transient self: RDD[T],
  id: Int
  ) {

def eager: this.type = {
  AsArray.future
  this
}

case object AsArray {

  @transient lazy val future: FutureAction[Array[T]] = {
var result: Array[T] = null

val future = self.context.submitJob[T, Array[T], Array[T]](
  self,
  _.toArray,
  Seq(id), { (_, data) =>
result = data
  },
  result
)

future
  }

  @transient lazy val now: Array[T] = future.get()
}
  }

  implicit class RDDFunctions[T: ClassTag](self: RDD[T]) {

import scala.concurrent.ExecutionContext.Implicits.global

def _toLocalIteratorPreemptively(capacity: Int): Iterator[Array[T]] = {
  val executions = self.partitions.indices.map { ii =>
PartitionExecution(self, ii)
  }

  val buffer = new ArrayBlockingQueue[Try[PartitionExecution[T]]](capacity)

  Future {
executions.foreach { exe =>
  buffer.put(Success(exe)) // may be blocking due to capacity
  exe.eager // non-blocking
}
  }.onFailure {
case e: Throwable =>
  buffer.put(Failure(e))
  }

  self

[jira] [Commented] (SPARK-6473) Launcher lib shouldn't try to figure out Scala version when not in dev mode

2017-12-05 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279355#comment-16279355
 ] 

Peng Cheng commented on SPARK-6473:
---

Looks like this issue reappear at some point:

getScalaVersion() will be called anyway even if not in dev/test mode:

https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L196

It was suppressed simply because SPARK_SCALA_VERSION is always set by shell 
script.

Should we add a test to ensure that it never happens?

> Launcher lib shouldn't try to figure out Scala version when not in dev mode
> ---
>
> Key: SPARK-6473
> URL: https://issues.apache.org/jira/browse/SPARK-6473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.4.0
>
>
> Thanks to [~nravi] for pointing this out.
> The launcher library currently always tries to figure out what's the build's 
> scala version, even when it's not needed. That code is only used when setting 
> some dev options, and relies on the layout of the build directories, so it 
> doesn't work with the directory layout created by make-distribution.sh.
> Right now this works on Linux because bin/load-spark-env.sh sets the Scala 
> version explicitly, but it would break the distribution on Windows, for 
> example.
> Fix is pretty straight-forward.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-01 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174442#comment-15174442
 ] 

Peng Cheng commented on SPARK-7481:
---

+1 Me four

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-12-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073036#comment-15073036
 ] 

Peng Cheng commented on SPARK-10625:


I think the pull request has been merged. Can it be closed now?

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-12-15 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059390#comment-15059390
 ] 

Peng Cheng commented on SPARK-10625:


https://github.com/apache/spark/pull/8785

this one :)
others are already withdrawn, forgive me for not knowing the rule for pull 
requests.

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-11-27 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030259#comment-15030259
 ] 

Peng Cheng commented on SPARK-10625:


squash rebased on master

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-11-27 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030257#comment-15030257
 ] 

Peng Cheng commented on SPARK-10625:


Rebased on 1.6 branch

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-11-27 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030256#comment-15030256
 ] 

Peng Cheng commented on SPARK-10625:


rebased on 1.5 branch

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-11-27 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030087#comment-15030087
 ] 

Peng Cheng commented on SPARK-10625:


Yes it is 2 months behind master branch. I've just rebased on master and 
patched all unit test errors (a dependency used by the new JDBCRDD now 
validates if all connection properties can be casted into String, so the deep 
copy is now almost used anywhere to correct some particularly undisciplined 
JDBC driver implementations). Is it mergeable now?
If you like I can rebase on 1.6 branch and issue a pull request shortly.

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-11-26 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029361#comment-15029361
 ] 

Peng Cheng commented on SPARK-10625:


Hi committers, after 1.5.2 release this is still not merged, now this bugfix is 
blocking us from committing other features (namely JDBC dialects for several 
databases). Can you prompt me what should be done next to make it through the 
next release? (resolve conflict? add more tests? should be easy). Thanks a lot 
pals --Yours Peng

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-10-09 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950901#comment-14950901
 ] 

Peng Cheng commented on SPARK-7708:
---

Also, this problem doesn't only affect ClosureSerializer, if a class contains 
SerializableWritable[org.apache.hadoop.conf.Configuration] it will also throw 
the same unserialization exception

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-10-09 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950886#comment-14950886
 ] 

Peng Cheng commented on SPARK-7708:
---

This problem is affecting me and so far there is no trace whether it will 
appear (sometimes SerializableWritable can be serialized smoothly). Does 
anybody know how to bypass it?

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-10-02 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941775#comment-14941775
 ] 

Peng Cheng commented on SPARK-10625:


The patch can be merged immediately, can some one verify and merge it?

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-21 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900980#comment-14900980
 ] 

Peng Cheng commented on SPARK-10625:


Exactly as you described, driver can always be initialized from username, 
password and url, the other objects are for resource pooling or saving object 
creation costs.
Not sure if it applies to all drivers, so far I haven't seen a counter-example

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-16 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791252#comment-14791252
 ] 

Peng Cheng commented on SPARK-10625:


A pull request has been send that contains 2 extra unit tests and a simple fix:
https://github.com/apache/spark/pull/8785

Can you help me validating it and merge in 1.5.1?

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-15 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746450#comment-14746450
 ] 

Peng Cheng commented on SPARK-10625:


branch: https://github.com/Schedule1/spark/tree/SPARK-10625
will submit a pull request after I fixed all tests

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-15 Thread Peng Cheng (JIRA)
Peng Cheng created SPARK-10625:
--

 Summary: Spark SQL JDBC read/write is unable to handle JDBC 
Drivers that adds unserializable objects into connection properties
 Key: SPARK-10625
 URL: https://issues.apache.org/jira/browse/SPARK-10625
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0, 1.4.1
 Environment: Ubuntu 14.04
Reporter: Peng Cheng


Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
adding new objects into the connection properties, which is then reused by 
Spark to be deployed to workers. When some of these new objects are unable to 
be serializable it will trigger an org.apache.spark.SparkException: Task not 
serializable. The following test code snippet demonstrate this problem by using 
a modified H2 driver:

  test("INSERT to JDBC Datasource with UnserializableH2Driver") {

object UnserializableH2Driver extends org.h2.Driver {

  override def connect(url: String, info: Properties): Connection = {

val result = super.connect(url, info)
info.put("unserializableDriver", this)
result
  }

  override def getParentLogger: Logger = ???
}

import scala.collection.JavaConversions._

val oldDrivers = 
DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
oldDrivers.foreach{
  DriverManager.deregisterDriver
}
DriverManager.registerDriver(UnserializableH2Driver)

sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
properties).collect()(0).length)

DriverManager.deregisterDriver(UnserializableH2Driver)
oldDrivers.foreach{
  DriverManager.registerDriver
}
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2015-08-19 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703585#comment-14703585
 ] 

Peng Cheng commented on SPARK-7481:
---

Thanks a lot! A long run to the end.

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-06-11 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582965#comment-14582965
 ] 

Peng Cheng commented on SPARK-7442:
---

Still not fixed in 1.4.0 ...
reverting to hadoop 2.4 until this is resolved.

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-05-24 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557820#comment-14557820
 ] 

Peng Cheng edited comment on SPARK-7442 at 5/24/15 6:55 PM:


Adding jar won't solve the problem:
you need to set the following parameters:

  --conf 
spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

But in my 2.6 environment the fs implementation in the added jar is ignored by 
worker's classloader for unknow reason, see 
http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar


was (Author: peng):
Adding jar won't solve the problem:
you need to set the following parameters:

  --conf 
spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

But in my 2.6 environment the added jar is ignored by worker's classloader for 
unknow reason, see 
http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-05-24 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557820#comment-14557820
 ] 

Peng Cheng commented on SPARK-7442:
---

Adding jar won't solve the problem:
you need to set the following parameters:

  --conf 
spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem

But in my 2.6 environment the added jar is ignored by worker's classloader for 
unknow reason, see 
http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2015-05-23 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557549#comment-14557549
 ] 

Peng Cheng commented on SPARK-7481:
---

I've tried to do it but I get a lot of headaches, as aws toolkit is using an 
outdate jackson library.
Though this feature is indeed blocking me from upgrading to hadoop 2.6. So I 
guess its important

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7043) KryoSerializer cannot be used with REPL to interpret code in which case class definition and its shipping are in the same line

2015-04-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507318#comment-14507318
 ] 

Peng Cheng commented on SPARK-7043:
---

Sorry, didn't read all the issues. Some of them may have the same cause.

> KryoSerializer cannot be used with REPL to interpret code in which case class 
> definition and its shipping are in the same line
> --
>
> Key: SPARK-7043
> URL: https://issues.apache.org/jira/browse/SPARK-7043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.3.1
> Environment: Ubuntu 14.04, no hadoop
>Reporter: Peng Cheng
>Priority: Minor
>  Labels: classloader, kryo
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When deploying Spark-shell with
> "spark.serializer=org.apache.spark.serializer.KryoSerializer" option. 
> Spark-shell cannot execute the following code (in 1 line):
> case class Foo(i: Int);val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).collect()
> This problem won't exist for either JavaSerializer or code splitted into 2 
> lines. The only possible explanation is that KryoSerializer is using a 
> ClassLoader that is not registered as an subsidiary ClassLoader of the one in 
> REPL.
> A "dirty" fix would be just breaking input by semicolon, but its better to 
> fix the ClassLoader to avoid other liabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7043) KryoSerializer cannot be used with REPL to interpret code in which case class definition and its shipping are in the same line

2015-04-21 Thread Peng Cheng (JIRA)
Peng Cheng created SPARK-7043:
-

 Summary: KryoSerializer cannot be used with REPL to interpret code 
in which case class definition and its shipping are in the same line
 Key: SPARK-7043
 URL: https://issues.apache.org/jira/browse/SPARK-7043
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04, no hadoop
Reporter: Peng Cheng


When deploying Spark-shell with
"spark.serializer=org.apache.spark.serializer.KryoSerializer" option. 
Spark-shell cannot execute the following code (in 1 line):

case class Foo(i: Int);val ret = sc.parallelize((1 to 100).map(Foo), 
10).collect()

This problem won't exist for either JavaSerializer or code splitted into 2 
lines. The only possible explanation is that KryoSerializer is using a 
ClassLoader that is not registered as an subsidiary ClassLoader of the one in 
REPL.

A "dirty" fix would be just breaking input by semicolon, but its better to fix 
the ClassLoader to avoid other liabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-03-27 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385029#comment-14385029
 ] 

Peng Cheng commented on SPARK-4590:
---

Hi Reza, that's great news. Can someone point us to the thread referring the 
`IndexedRDD`?

> Early investigation of parameter server
> ---
>
> Key: SPARK-4590
> URL: https://issues.apache.org/jira/browse/SPARK-4590
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Reza Zadeh
>
> In the currently implementation of GLM solvers, we save intermediate models 
> on the driver node and update it through broadcast and aggregation. Even with 
> torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
> ~10 million features. This JIRA is for investigating the parameter server 
> approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-05 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349760#comment-14349760
 ] 

Peng Cheng commented on SPARK-4925:
---

Me too, though the patch is applied in 1.2.1.
We just need the last step

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.2.0
>Reporter: Alex Liu
> Fix For: 1.3.0, 1.2.1
>
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263681#comment-14263681
 ] 

Peng Cheng commented on SPARK-4923:
---

You are right, in fact 'Dev's API' simply means method is susceptible to 
changes without deprecation or notice, which the main 3 markings will be least 
likely to undergo.
Could you please edit the patch and add more markings?

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2014-12-30 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261185#comment-14261185
 ] 

Peng Cheng commented on SPARK-4923:
---

Hey, a guy waiting for it was complaining that the version number of 
spark-repl_2.10 on typesafe repo is actually 1.2.0-SNAPSHOT, not the released 
1.2.0
Is it done on purpose?

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl

2014-12-22 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-4923:
--
Target Version/s: 1.3.0, 1.2.1  (was: 1.3.0)

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl

2014-12-22 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-4923:
--
Attachment: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

thank you so much! First patch uploaded

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2014-12-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256322#comment-14256322
 ] 

Peng Cheng commented on SPARK-4923:
---

Sorry my project is https://github.com/tribbloid/ISpark

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2014-12-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256319#comment-14256319
 ] 

Peng Cheng commented on SPARK-4923:
---

Hey Patrick,

The following API has been integrated since 1.0.0, IMHO they are stable enough 
for daily prototyping, creating case class used to be defective but has been 
fixed long time ago.
SparkILoop.getAddedJars()
$SparkIMain.bind
$SparkIMain.quietBind
$SparkIMain.interpret
end of :)

At first I assume that further development on it has been moved to databricks 
cloud. But the JIRA ticket was already there in September. So maybe demand on 
this API from the community is indeed low enough.
However, I would still suggest keeping it, even promoting it into a Developer's 
API, this would encourage more projects to integrate in a more flexible way, 
and save prototyping/QA cost by customizing fixtures of REPL. People will still 
move to databricks cloud, which has far more features than that. Many 
influential projects already depends on the routinely published Scala-REPL 
(e.g. playFW), it would be strange for Spark not doing the same.
What do you think? 

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4923) Maven build should keep publishing spark-repl

2014-12-22 Thread Peng Cheng (JIRA)
Peng Cheng created SPARK-4923:
-

 Summary: Maven build should keep publishing spark-repl
 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical


Spark-repl installation and deployment has been discontinued (see SPARK-3452). 
But its in the dependency list of a few projects that extends its 
initialization process.
Please remove the 'skip' setting in spark-repl and make it an 'official' API to 
encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2014-12-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256228#comment-14256228
 ] 

Peng Cheng commented on SPARK-3452:
---

IMHO REPL should be kept being published, one of my project extends its API to 
initialize some third-party components upon launching.
This should be made an 'official' API to encourage more platform integrate with 
it.

> Maven build should skip publishing artifacts people shouldn't depend on
> ---
>
> Key: SPARK-3452
> URL: https://issues.apache.org/jira/browse/SPARK-3452
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
>Priority: Critical
> Fix For: 1.2.0
>
>
> I think it's easy to do this by just adding a skip configuration somewhere. 
> We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1270) An optimized gradient descent implementation

2014-10-02 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156964#comment-14156964
 ] 

Peng Cheng commented on SPARK-1270:
---

Yo, any follow up story on this one?
I'm curious to know the local update part, as DistBelief has non-local model 
server shards.

> An optimized gradient descent implementation
> 
>
> Key: SPARK-1270
> URL: https://issues.apache.org/jira/browse/SPARK-1270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xusen Yin
>  Labels: GradientDescent, MLLib,
> Fix For: 1.0.0
>
>
> Current implementation of GradientDescent is inefficient in some aspects, 
> especially in high-latency network. I propose a new implementation of 
> GradientDescent, which follows a parallelism model called 
> GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric 
> Xing's SSP. With a few modifications of runMiniBatchSGD, the 
> GradientDescentWithLocalUpdate can outperform the original sequential version 
> by about 4x without sacrificing accuracy, and can be easily adopted by most 
> classification and regression algorithms in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2083) Allow local task to retry after failure.

2014-06-09 Thread Peng Cheng (JIRA)
Peng Cheng created SPARK-2083:
-

 Summary: Allow local task to retry after failure.
 Key: SPARK-2083
 URL: https://issues.apache.org/jira/browse/SPARK-2083
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Peng Cheng
Priority: Trivial


If a job is submitted to run locally using masterURL = "local[X]", spark will 
not retry a failed task regardless of your "spark.task.maxFailures" setting. 
This design is to facilitate debugging and QA of spark application where all 
tasks are expected to succeed and yield a results. Unfortunately, such setting 
will prevent a local job from finished if any of its task cannot guarantee a 
result (e.g. visiting an external resouce/API), and retrying inside the task is 
less favoured (e.g. the task needs to be executed on a different computer on 
production).

User however can still set masterURL ="local[X,Y]" to override this (where Y is 
the local maxFailures), but it is not documented and hard to manage. A quick 
fix to this can be to add a new configuration property 
"spark.local.maxFailures" with a default value of 1. So user knows exactly 
where to change when reading the documentation




--
This message was sent by Atlassian JIRA
(v6.2#6252)