[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963815#comment-15963815
 ] 

Wenchen Fan commented on SPARK-12837:
-

[~Tagar] due to some implementation limitations, in Spark SQL we have to 
collect the whole broadcast dataset at driver side. Driver-free broadcast is on 
the roadmap, for now maybe you can set  `autoBroadcastJoinThreshold` lower.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963812#comment-15963812
 ] 

Ruslan Dautkhanov commented on SPARK-12837:
---

[~cloud_fan] I didn't realize torrent broadcast could store some blocks on 
driver. 
Should be there a knob to disable or limit soring broadcast data on driver?
We use Spark dynamic allocation so minimal number of executors is 10, in this 
case
potentially this can be disabled at all. Thanks.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2017-04-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17564.
-
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.0
   2.1.1

> Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
> --
>
> Key: SPARK-17564
> URL: https://issues.apache.org/jira/browse/SPARK-17564
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Could be related to [SPARK-10680]
> This is the test and one fix would be to increase the timeouts from 1.2 
> seconds to 5 seconds
> {code}
> // The timeout is relative to the LAST request sent, which is kinda weird, 
> but still.
>   // This test also makes sure the timeout works for Fetch requests as well 
> as RPCs.
>   @Test
>   public void furtherRequestsDelay() throws Exception {
> final byte[] response = new byte[16];
> final StreamManager manager = new StreamManager() {
>   @Override
>   public ManagedBuffer getChunk(long streamId, int chunkIndex) {
> Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
> return new NioManagedBuffer(ByteBuffer.wrap(response));
>   }
> };
> RpcHandler handler = new RpcHandler() {
>   @Override
>   public void receive(
>   TransportClient client,
>   ByteBuffer message,
>   RpcResponseCallback callback) {
> throw new UnsupportedOperationException();
>   }
>   @Override
>   public StreamManager getStreamManager() {
> return manager;
>   }
> };
> TransportContext context = new TransportContext(conf, handler);
> server = context.createServer();
> clientFactory = context.createClientFactory();
> TransportClient client = 
> clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());
> // Send one request, which will eventually fail.
> TestCallback callback0 = new TestCallback();
> client.fetchChunk(0, 0, callback0);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // Send a second request before the first has failed.
> TestCallback callback1 = new TestCallback();
> client.fetchChunk(0, 1, callback1);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // not complete yet, but should complete soon
> assertEquals(-1, callback0.successLength);
> assertNull(callback0.failure);
> callback0.latch.await(60, TimeUnit.SECONDS);
> assertTrue(callback0.failure instanceof IOException);
> // failed at same time as previous
> assertTrue(callback1.failure instanceof IOException); // This is where we 
> fail because callback1.failure is null
>   }
> {code}
> If there are better suggestions for improving this test let's take them 
> onboard, I think using 5 sec timeout periods would be a place to start so 
> folks don't need to needlessly triage this failure. Will add a few prints and 
> report back



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-04-10 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963741#comment-15963741
 ] 

Yan Facai (颜发才) commented on SPARK-3383:


I think the task contains two subtask:

1. separate `split` with `bin`: Now for each categorical feature, there is 1 
bin per split. That's said, for N categories, the communicate cost is 2^{N-1} - 
1 bins. However, if we only get stats for each category, and construct splits 
finally. Namely, 1 bin per category. The communicate cost is N bins.

2. As said in Description, store all but the last bin, and also store the total 
statistics for each node. The communicate cost will be N-1 bins.

I have a question:
1. why unordered features only are allowed in multiclass classification?


> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963705#comment-15963705
 ] 

Wenchen Fan edited comment on SPARK-12837 at 4/11/17 1:06 AM:
--

[~Tagar] I think this may be because the size estimation was not accurate, can 
you try master branch? It should have been fixed.


was (Author: cloud_fan):
[~Tagar] I think this may be because the size estimation was not accurate, can 
you try master branch? It should be fixed.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963705#comment-15963705
 ] 

Wenchen Fan commented on SPARK-12837:
-

[~Tagar] I think this may be because the size estimation was not accurate, can 
you try master branch? It should be fixed.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-10 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-20287:
---

 Summary: Kafka Consumer should be able to subscribe to more than 
one topic partition
 Key: SPARK-20287
 URL: https://issues.apache.org/jira/browse/SPARK-20287
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Stephane Maarek


As I understand and as it stands, one Kafka Consumer is created for each topic 
partition in the source Kafka topics, and they're cached.

cf 
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48

In my opinion, that makes the design an anti pattern for Kafka and highly 
unefficient:
- Each Kafka consumer creates a connection to Kafka
- Spark doesn't leverage the power of the Kafka consumers, which is that it 
automatically assigns and balances partitions amongst all the consumers that 
share the same group.id
- You can still cache your Kafka consumer even if it has multiple partitions.

I'm not sure about how that translates to the spark underlying RDD 
architecture, but from a Kafka standpoint, I believe creating one consumer per 
partition is a big overhead, and a risk as the user may have to increase the 
spark.streaming.kafka.consumer.cache.maxCapacity parameter. 

Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18555) na.fill miss up original values in long integers

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963685#comment-15963685
 ] 

Apache Spark commented on SPARK-18555:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/17601

> na.fill miss up original values in long integers
> 
>
> Key: SPARK-18555
> URL: https://issues.apache.org/jira/browse/SPARK-18555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Mahmoud Rawas
>Assignee: Song Jun
>Priority: Critical
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Manly the issue is clarified in the following example:
> Given a Dataset: 
> scala> data.show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426677101|9123146560113991650|
> theoretically when we call na.fill(0) nothing should change, while the 
> current result is:
> scala> data.na.fill(0).show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426676736|9123146560113991680|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18555) na.fill miss up original values in long integers

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963684#comment-15963684
 ] 

Apache Spark commented on SPARK-18555:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/17600

> na.fill miss up original values in long integers
> 
>
> Key: SPARK-18555
> URL: https://issues.apache.org/jira/browse/SPARK-18555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Mahmoud Rawas
>Assignee: Song Jun
>Priority: Critical
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Manly the issue is clarified in the following example:
> Given a Dataset: 
> scala> data.show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426677101|9123146560113991650|
> theoretically when we call na.fill(0) nothing should change, while the 
> current result is:
> scala> data.na.fill(0).show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426676736|9123146560113991680|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double

2017-04-10 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20270:

Fix Version/s: 2.1.1
   2.0.3

> na.fill will change the values in long or integer when the default value is 
> in double
> -
>
> Key: SPARK-20270
> URL: https://issues.apache.org/jira/browse/SPARK-20270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> This bug was partially addressed in SPARK-18555, but the root cause isn't 
> completely solved. This bug is pretty critical since it changes the member id 
> in Long in our application if the member id can not be represented by Double 
> losslessly when the member id is very big. 
> Here is an example how this happens, with
> {code}
>   Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
> (9123146099426677101L, null),
> (9123146560113991650L, 1.6), (null, null)).toDF("a", 
> "b").na.fill(0.2),
> {code}
> the logical plan will be
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as 
> bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as 
> double) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}.
> Note that even the value is not null, Spark will cast the Long into Double 
> first. Then if it's not null, Spark will cast it back to Long which results 
> in losing precision. 
> The behavior should be that the original value should not be changed if it's 
> not null, but Spark will change the value which is wrong.
> With the PR, the logical plan will be 
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, 
> coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}
> which behaves correctly without changing the original Long values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18555) na.fill miss up original values in long integers

2017-04-10 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-18555:

Fix Version/s: 2.1.1
   2.0.3
  Component/s: SQL

> na.fill miss up original values in long integers
> 
>
> Key: SPARK-18555
> URL: https://issues.apache.org/jira/browse/SPARK-18555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Mahmoud Rawas
>Assignee: Song Jun
>Priority: Critical
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Manly the issue is clarified in the following example:
> Given a Dataset: 
> scala> data.show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426677101|9123146560113991650|
> theoretically when we call na.fill(0) nothing should change, while the 
> current result is:
> scala> data.na.fill(0).show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426676736|9123146560113991680|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17564:


Assignee: Apache Spark

> Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
> --
>
> Key: SPARK-17564
> URL: https://issues.apache.org/jira/browse/SPARK-17564
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Assignee: Apache Spark
>Priority: Minor
>
> Could be related to [SPARK-10680]
> This is the test and one fix would be to increase the timeouts from 1.2 
> seconds to 5 seconds
> {code}
> // The timeout is relative to the LAST request sent, which is kinda weird, 
> but still.
>   // This test also makes sure the timeout works for Fetch requests as well 
> as RPCs.
>   @Test
>   public void furtherRequestsDelay() throws Exception {
> final byte[] response = new byte[16];
> final StreamManager manager = new StreamManager() {
>   @Override
>   public ManagedBuffer getChunk(long streamId, int chunkIndex) {
> Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
> return new NioManagedBuffer(ByteBuffer.wrap(response));
>   }
> };
> RpcHandler handler = new RpcHandler() {
>   @Override
>   public void receive(
>   TransportClient client,
>   ByteBuffer message,
>   RpcResponseCallback callback) {
> throw new UnsupportedOperationException();
>   }
>   @Override
>   public StreamManager getStreamManager() {
> return manager;
>   }
> };
> TransportContext context = new TransportContext(conf, handler);
> server = context.createServer();
> clientFactory = context.createClientFactory();
> TransportClient client = 
> clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());
> // Send one request, which will eventually fail.
> TestCallback callback0 = new TestCallback();
> client.fetchChunk(0, 0, callback0);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // Send a second request before the first has failed.
> TestCallback callback1 = new TestCallback();
> client.fetchChunk(0, 1, callback1);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // not complete yet, but should complete soon
> assertEquals(-1, callback0.successLength);
> assertNull(callback0.failure);
> callback0.latch.await(60, TimeUnit.SECONDS);
> assertTrue(callback0.failure instanceof IOException);
> // failed at same time as previous
> assertTrue(callback1.failure instanceof IOException); // This is where we 
> fail because callback1.failure is null
>   }
> {code}
> If there are better suggestions for improving this test let's take them 
> onboard, I think using 5 sec timeout periods would be a place to start so 
> folks don't need to needlessly triage this failure. Will add a few prints and 
> report back



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17564:


Assignee: (was: Apache Spark)

> Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
> --
>
> Key: SPARK-17564
> URL: https://issues.apache.org/jira/browse/SPARK-17564
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Could be related to [SPARK-10680]
> This is the test and one fix would be to increase the timeouts from 1.2 
> seconds to 5 seconds
> {code}
> // The timeout is relative to the LAST request sent, which is kinda weird, 
> but still.
>   // This test also makes sure the timeout works for Fetch requests as well 
> as RPCs.
>   @Test
>   public void furtherRequestsDelay() throws Exception {
> final byte[] response = new byte[16];
> final StreamManager manager = new StreamManager() {
>   @Override
>   public ManagedBuffer getChunk(long streamId, int chunkIndex) {
> Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
> return new NioManagedBuffer(ByteBuffer.wrap(response));
>   }
> };
> RpcHandler handler = new RpcHandler() {
>   @Override
>   public void receive(
>   TransportClient client,
>   ByteBuffer message,
>   RpcResponseCallback callback) {
> throw new UnsupportedOperationException();
>   }
>   @Override
>   public StreamManager getStreamManager() {
> return manager;
>   }
> };
> TransportContext context = new TransportContext(conf, handler);
> server = context.createServer();
> clientFactory = context.createClientFactory();
> TransportClient client = 
> clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());
> // Send one request, which will eventually fail.
> TestCallback callback0 = new TestCallback();
> client.fetchChunk(0, 0, callback0);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // Send a second request before the first has failed.
> TestCallback callback1 = new TestCallback();
> client.fetchChunk(0, 1, callback1);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // not complete yet, but should complete soon
> assertEquals(-1, callback0.successLength);
> assertNull(callback0.failure);
> callback0.latch.await(60, TimeUnit.SECONDS);
> assertTrue(callback0.failure instanceof IOException);
> // failed at same time as previous
> assertTrue(callback1.failure instanceof IOException); // This is where we 
> fail because callback1.failure is null
>   }
> {code}
> If there are better suggestions for improving this test let's take them 
> onboard, I think using 5 sec timeout periods would be a place to start so 
> folks don't need to needlessly triage this failure. Will add a few prints and 
> report back



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963663#comment-15963663
 ] 

Apache Spark commented on SPARK-17564:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17599

> Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
> --
>
> Key: SPARK-17564
> URL: https://issues.apache.org/jira/browse/SPARK-17564
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Could be related to [SPARK-10680]
> This is the test and one fix would be to increase the timeouts from 1.2 
> seconds to 5 seconds
> {code}
> // The timeout is relative to the LAST request sent, which is kinda weird, 
> but still.
>   // This test also makes sure the timeout works for Fetch requests as well 
> as RPCs.
>   @Test
>   public void furtherRequestsDelay() throws Exception {
> final byte[] response = new byte[16];
> final StreamManager manager = new StreamManager() {
>   @Override
>   public ManagedBuffer getChunk(long streamId, int chunkIndex) {
> Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
> return new NioManagedBuffer(ByteBuffer.wrap(response));
>   }
> };
> RpcHandler handler = new RpcHandler() {
>   @Override
>   public void receive(
>   TransportClient client,
>   ByteBuffer message,
>   RpcResponseCallback callback) {
> throw new UnsupportedOperationException();
>   }
>   @Override
>   public StreamManager getStreamManager() {
> return manager;
>   }
> };
> TransportContext context = new TransportContext(conf, handler);
> server = context.createServer();
> clientFactory = context.createClientFactory();
> TransportClient client = 
> clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());
> // Send one request, which will eventually fail.
> TestCallback callback0 = new TestCallback();
> client.fetchChunk(0, 0, callback0);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // Send a second request before the first has failed.
> TestCallback callback1 = new TestCallback();
> client.fetchChunk(0, 1, callback1);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // not complete yet, but should complete soon
> assertEquals(-1, callback0.successLength);
> assertNull(callback0.failure);
> callback0.latch.await(60, TimeUnit.SECONDS);
> assertTrue(callback0.failure instanceof IOException);
> // failed at same time as previous
> assertTrue(callback1.failure instanceof IOException); // This is where we 
> fail because callback1.failure is null
>   }
> {code}
> If there are better suggestions for improving this test let's take them 
> onboard, I think using 5 sec timeout periods would be a place to start so 
> folks don't need to needlessly triage this failure. Will add a few prints and 
> report back



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2017-04-10 Thread Rupesh Mane (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963621#comment-15963621
 ] 

Rupesh Mane commented on SPARK-18084:
-

Any update on when this will be fixed?

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",
>  line 656, in text
> 

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963601#comment-15963601
 ] 

Michael Gummelt commented on SPARK-16742:
-

bq. So, assuming that Mesos is configured properly, then it should be OK for 
Spark code to distribute user credentials.

Right.  It's just a matter of the cluster admin syncing Mesos credentials and 
kerberos credentials properly.  In summary, it's simpler in YARN because YARN 
is Kerberos-aware, whereas Mesos isn't.

bq. That sounds like you might need the current code that distributes keytabs 
and logs in the cluster to make even client mode work in this setup.

Since client mode requires network access to the Mesos master, we generally 
assume that the user is on the same network as their datacenter, and can thus 
kinit against the KDC.


> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963592#comment-15963592
 ] 

Marcelo Vanzin commented on SPARK-16742:


bq. It authenticates the Mesos principal, and this principal is allowed to 
launch processes only as certain Linux users. It's up the cluster admin to 
setup this mapping appropriately.

Ok, that sounds similar then. Basically, you *can* set up Mesos so that it can 
do this securely, which was my initial question. (Being able to set things up 
in an insecure way is not actually that interesting; I just wanted to make sure 
there *is* a way to set things up securely.)

So, assuming that Mesos is configured properly, then it should be OK for Spark 
code to distribute user credentials.

bq. I actually said a "user might not be kinit'd". They may, however, have 
access to the keytab.

That sounds like you might need the current code that distributes keytabs and 
logs in the cluster to make even client mode work in this setup.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963583#comment-15963583
 ] 

Michael Gummelt commented on SPARK-16742:
-

bq. That sounds problematic. The way YARN works is that it actually 
authenticates the user. Are you saying that Mesos doesn't do user 
authentication?

AFAICT, YARN doesn't authenticate the Linux user.  The KDC authenticates the 
kerberos principal, and YARN maps this principal to a Linux user via 
{{hadoop.security.auth_to_local}}.  So if a user authenticated to the KDC via a 
principal "Joe", and the {{auth_to_local}} rule maps "Joe" to "root", then 
"Joe" can launch processes as "root", even though he never provided "root" 
credentials.  It's up to the cluster administrator to properly setup this 
Kerberos -> Linux mapping.

It's a similar story with Mesos.  Mesos doesn't authenticate the Linux user.  
It authenticates the Mesos principal, and this principal is allowed to launch 
processes only as certain Linux users.  It's up the cluster admin to setup this 
mapping appropriately.

The big difference is that, by default, YARN will map the kerberos principal to 
the linux user with the same name, so there's no problem.  Whereas Mesos will 
allow the driver to launch executors as any user that their Mesos principal is 
allowed to launch users as.  So it's up to the admin to only provide users with 
consistent Mesos and Kerberos credentials.

bq. Are you saying that for YARN or Mesos? When YARN runs in Kerberos mode, 
Kerberos dictates the user.

I'm talking about YARN.  See the above comment.  If {{auth_to_local}} is used 
like I think it is, then that's what ultimately determines the Linux user, not 
just Kerberos.

bq.  The use case you mention ("user starting an application in cluster mode 
with no kerberos credentials") sounds actually worrying

I actually said a "user might not be kinit'd".  They may, however, have access 
to the keytab.  But since they're not on the same network as the KDC, they 
can't authenticate directly.  But they do have the creds.


> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20284:
--
Priority: Trivial  (was: Minor)

It might be fine to add, but, does it help anything? try-with-resources is only 
for Java.

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Priority: Trivial
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-10 Thread JIRA
Miguel Pérez created SPARK-20286:


 Summary: dynamicAllocation.executorIdleTimeout is ignored after 
unpersist
 Key: SPARK-20286
 URL: https://issues.apache.org/jira/browse/SPARK-20286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Miguel Pérez


With dynamic allocation enabled, it seems that executors with cached data which 
are unpersisted are still being killed using the 
{{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
{{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor with 
unpersisted data won't be released until the job ends.

*How to reproduce*
- Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
{{dynamicAllocation.cachedExecutorIdleTimeout}}
- Load a file into a RDD and persist it
- Execute an action on the RDD (like a count) so some executors are activated.
- When the action has finished, unpersist the RDD
- The application UI removes correctly the persisted data from the *Storage* 
tab, but if you look in the *Executors* tab, you will find that the executors 
remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963559#comment-15963559
 ] 

Marcelo Vanzin commented on SPARK-16742:


bq. But in Spark, this isn't currently derived from the Kerberos principal. 
It's configured by the user. 

That sounds problematic. The way YARN works is that it actually authenticates 
the user. Are you saying that Mesos doesn't do user authentication?

The overarching point I'm trying to make with my comments is that for kerberos 
support to be properly secure, the cluster manager needs to be secure. That 
means running applications from different users in a way that doesn't allow 
them to hack each other. YARN does that by doing authentication when users 
request applications to run, and by running the containers as the requested 
user. The exact way in which YARN achieves that seems kinda tangential to the 
actual question, which is:

What is the story for Mesos?

Basically, the way in which you support Kerberos will depend on how your 
cluster manager does security. If Mesos behaves more like Spark Standalone than 
it does like YARN, then any solution that requires distributing user 
credentials is a non-starter, because it just becomes a security liability.

bq. It would be a vulnerability, for example, if the Linux user for the 
executors is simply derived from that of the driver, because two human users 
running as the same Linux user, but logged in via different Kerberos 
principals, would be able to see each others' tokens.

Are you saying that for YARN or Mesos? When YARN runs in Kerberos mode, 
Kerberos dictates the user. That's how the user is authenticating to YARN. 
There's a requirement that an OS user exists matching that particular user, but 
that's just a configuration detail. The security comes from the fact that the 
user is authenticating to the KDC.

bq. You're right that we could implement cluster mode in some form, but I'd 
rather keep the initial PR small. I hope that's acceptable.

The main point I'm trying to convey here is that running things in client and 
cluster mode should be exactly the same from the point of view of distributing 
tokens. The use case you mention ("user starting an application in cluster mode 
with no kerberos credentials") sounds actually worrying, because what's 
authenticating the user?

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963510#comment-15963510
 ] 

Ruslan Dautkhanov edited comment on SPARK-12837 at 4/10/17 9:29 PM:


It might be a bug in broadcast join.

Following Spark 2 query fails with 
{quote}Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB){quote}
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . 
m.year, 
m.quarter, 
t.individ, 
t.hh_key
FROMmv_dev.mv_raw_all_20170314 m, 
disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.


was (Author: tagar):
It might be a bug in broadcast join.

Following Spark 2 query fails with 
Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB)
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . 
m.year, 
m.quarter, 
t.individ, 
t.hh_key
FROMmv_dev.mv_raw_all_20170314 m, 
disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963510#comment-15963510
 ] 

Ruslan Dautkhanov commented on SPARK-12837:
---

It might be a bug in broadcast join.

Following Spark 2 query fails with 
Total size of serialized results of 128 tasks (1026.2 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB)
when we set
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 500 *1024*1024)   # 500 mb
{code}

{noformat}
SELECT  . . . 
m.year, 
m.quarter, 
t.individ, 
t.hh_key
FROMmv_dev.mv_raw_all_20170314 m, 
disc_dv.tsp_dv_02122017 t
where m.psn = t.person_seq_no 
limit 10
{noformat}

when we drop down to 400Mb
{code}
sqlc.setConf("spark.sql.autoBroadcastJoinThreshold", 400 *1024*1024)   # 400 mb
{code}
, this error does not show up.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20283) Add preOptimizationBatches

2017-04-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20283.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add preOptimizationBatches
> --
>
> Key: SPARK-20283
> URL: https://issues.apache.org/jira/browse/SPARK-20283
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
> This patch adds preOptimizationBatches so the optimizer debugging extensions 
> are symmetric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20282.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.0

> Flaky test: 
> org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
> -
>
> Key: SPARK-20282
> URL: https://issues.apache.org/jira/browse/SPARK-20282
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> I saw the following failure several times:
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Assert on query failed: 
> == Progress ==
>AssertOnQuery(, )
>StopStream
>AddData to MemoryStream[value#30891]: 1,2
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
>CheckAnswer: [6],[3]
>StopStream
> => AssertOnQuery(, )
>AssertOnQuery(, )
>StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
>CheckAnswer: [6],[3]
>StopStream
>AddData to MemoryStream[value#30891]: 3
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
>CheckLastBatch: [2]
>StopStream
>AddData to MemoryStream[value#30891]: 0
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
>ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
>AssertOnQuery(, )
>AssertOnQuery(, incorrect start offset or end offset on 
> exception)
> == Stream ==
> Output Mode: Append
> Stream state: not started
> Thread state: dead
> == Sink ==
> 0: [6] [3]
> == Plan ==
>  
>  
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> 

[jira] [Updated] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20285:
-
Affects Version/s: 2.1.1
   2.0.3

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20285:
-
Affects Version/s: (was: 2.1.1)
   (was: 2.0.3)
   (was: 2.2.0)
   2.1.0

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20285.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963469#comment-15963469
 ] 

Michael Gummelt commented on SPARK-16742:
-

[~jerryshao] Great! The current RPC used in Mesos is very simple.  The executor 
just periodically requests the latest credentials from the driver, which uses 
the keytab to periodically renew.  We can swap in a different mechanism once 
that exists.

I left a comment on your design doc.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446
 ] 

Michael Gummelt edited comment on SPARK-16742 at 4/10/17 8:35 PM:
--

[~vanzin]

bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user.  The scheduler's *Mesos* principal, along with ACLs 
configured in Mesos, is what determines which Linux users are allowed.  That's 
why I was asking about {{hadoop.security.auth_to_local}}, to understand how 
YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.

You're right that we could implement cluster mode in some form, but I'd rather 
keep the initial PR small.  I hope that's acceptable.


was (Author: mgummelt):
[~vanzin]

bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user.  The scheduler's *Mesos* principal, along with ACLs 
configured in Mesos, is what determines which Linux users are allowed.  That's 
why I was asking about {{hadoop.security.auth_to_local}}, to understand how 
YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446
 ] 

Michael Gummelt edited comment on SPARK-16742 at 4/10/17 8:20 PM:
--

[~vanzin]

bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user.  The scheduler's *Mesos* principal, along with ACLs 
configured in Mesos, is what determines which Linux users are allowed.  That's 
why I was asking about {{hadoop.security.auth_to_local}}, to understand how 
YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.


was (Author: mgummelt):
[~vanzin]

bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user, and the *Mesos* principal of the scheduler, along with 
ACLs configured in Mesos, is what determines which Linux users are allowed.  
That's why I was asking about {{hadoop.security.auth_to_local}}, to understand 
how YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446
 ] 

Michael Gummelt commented on SPARK-16742:
-

bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user, and the *Mesos* principal of the scheduler, along with 
ACLs configured in Mesos, is what determines which Linux users are allowed.  
That's why I was asking about {{hadoop.security.auth_to_local}}, to understand 
how YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963446#comment-15963446
 ] 

Michael Gummelt edited comment on SPARK-16742 at 4/10/17 8:18 PM:
--

[~vanzin]

bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user, and the *Mesos* principal of the scheduler, along with 
ACLs configured in Mesos, is what determines which Linux users are allowed.  
That's why I was asking about {{hadoop.security.auth_to_local}}, to understand 
how YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.


was (Author: mgummelt):
bq. The most basic feature needed for any kerberos-related work is user 
isolation (different users cannot mess with each others' processes). I was 
under the impression that Mesos supported that.

Mesos of course supports configuring the Linux user that process runs as.  But 
in Spark, this isn't currently derived from the Kerberos principal.  It's 
configured by the user, and the *Mesos* principal of the scheduler, along with 
ACLs configured in Mesos, is what determines which Linux users are allowed.  
That's why I was asking about {{hadoop.security.auth_to_local}}, to understand 
how YARN determines what Linux user to run executors as.  It would be a 
vulnerability, for example, if the Linux user for the executors is simply 
derived from that of the driver, because two human users running as the same 
Linux user, but logged in via different Kerberos principals, would be able to 
see each others' tokens.

bq. I don't know where this notion that cluster mode requires you to distribute 
keytabs comes from

As you said, it's mostly the renewal use case that requires distributing the 
keytab, but that's not all.  In many Mesos setups, and certainly in DC/OS, the 
submitting user might not already be kinit'd.  They may be running from outside 
the datacenter entirely, without network access to the KDC.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20280.
---
   Resolution: Fixed
 Assignee: Bogdan Raducanu
Fix Version/s: 2.2.0
   2.1.1

> SharedInMemoryCache Weigher integer overflow
> 
>
> Key: SPARK-20280
> URL: https://issues.apache.org/jira/browse/SPARK-20280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.1.1, 2.2.0
>
>
> in FileStatusCache.scala:
> {code}
> .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
>   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
> Int = {
> (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
>   }})
> {code}
> Weigher.weigh returns Int but the size of an Array[FileStatus] could be 
> bigger than Int.maxValue. Then, a negative value is returned, leading to this 
> exception:
> {code}
> * [info]   java.lang.IllegalStateException: Weights must be non-negative
> * [info]   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
> * [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
> * [info]   at 
> com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> * [info]   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963422#comment-15963422
 ] 

Shixiong Zhu commented on SPARK-20285:
--

https://github.com/apache/spark/pull/17597

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20285:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/17597)

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20284:


Assignee: (was: Apache Spark)

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963419#comment-15963419
 ] 

Apache Spark commented on SPARK-20284:
--

User 'superbobry' has created a pull request for this issue:
https://github.com/apache/spark/pull/17598

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20284:


Assignee: Apache Spark

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Assignee: Apache Spark
>Priority: Minor
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20285:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963412#comment-15963412
 ] 

Apache Spark commented on SPARK-20285:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17597

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20285:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20285) Flaky test:

2017-04-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-20285:


 Summary: Flaky test: 
 Key: SPARK-20285
 URL: https://issues.apache.org/jira/browse/SPARK-20285
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor


Saw the following failure locally:

{code}
Traceback (most recent call last):
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
in test_cogroup
self._test_func(input, func, expected, sort=True, input2=input2)
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
in _test_func
self.assertEqual(expected, result)
AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []

First list contains 3 additional elements.
First extra element 0:
[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]

+ []
- [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
-  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
-  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
{code}

It also happened on Jenkins: 
http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20285) Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20285:
-
Summary: Flaky test: 
pyspark.streaming.tests.BasicOperationTests.test_cogroup  (was: Flaky test: )

> Flaky test: pyspark.streaming.tests.BasicOperationTests.test_cogroup
> 
>
> Key: SPARK-20285
> URL: https://issues.apache.org/jira/browse/SPARK-20285
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Saw the following failure locally:
> {code}
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, 
> in test_cogroup
> self._test_func(input, func, expected, sort=True, input2=input2)
>   File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, 
> in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
> First list contains 3 additional elements.
> First extra element 0:
> [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
> + []
> - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
> -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
> -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
> {code}
> It also happened on Jenkins: 
> http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20156) Java String toLowerCase "Turkish locale bug" causes Spark problems

2017-04-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963406#comment-15963406
 ] 

Serkan Taş commented on SPARK-20156:


Thank you Sean, it was more than i expected i think.

> Java String toLowerCase "Turkish locale bug" causes Spark problems
> --
>
> Key: SPARK-20156
> URL: https://issues.apache.org/jira/browse/SPARK-20156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
> Environment: Ubunutu 16.04
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
>Reporter: Serkan Taş
>Assignee: Sean Owen
> Fix For: 2.2.0
>
> Attachments: sprk_shell.txt
>
>
> If the regional setting of the operation system is Turkish, the famous java 
> locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or 
> https://issues.apache.org/jira/browse/AVRO-1493). 
> e.g : 
> "SERDEINFO" lowers to "serdeınfo"
> "uniquetable" uppers to "UNİQUETABLE"
> work around : 
> add -Duser.country=US -Duser.language=en to the end of the line 
> SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"
> in spark-shell.sh



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963402#comment-15963402
 ] 

Apache Spark commented on SPARK-12837:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17596

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12837:


Assignee: Apache Spark  (was: Wenchen Fan)

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12837:


Assignee: Wenchen Fan  (was: Apache Spark)

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-10 Thread Sergei Lebedev (JIRA)
Sergei Lebedev created SPARK-20284:
--

 Summary: Make SerializationStream and DeserializationStream extend 
Closeable
 Key: SPARK-20284
 URL: https://issues.apache.org/jira/browse/SPARK-20284
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0, 1.6.3
Reporter: Sergei Lebedev
Priority: Minor


Both {{SerializationStream}} and {{DeserializationStream}} implement {{close}} 
but do not extend {{Closeable}}. As a result, these streams cannot be used in 
try-with-resources.

Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20156) Java String toLowerCase "Turkish locale bug" causes Spark problems

2017-04-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20156.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17527
[https://github.com/apache/spark/pull/17527]

> Java String toLowerCase "Turkish locale bug" causes Spark problems
> --
>
> Key: SPARK-20156
> URL: https://issues.apache.org/jira/browse/SPARK-20156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
> Environment: Ubunutu 16.04
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
>Reporter: Serkan Taş
>Assignee: Sean Owen
> Fix For: 2.2.0
>
> Attachments: sprk_shell.txt
>
>
> If the regional setting of the operation system is Turkish, the famous java 
> locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or 
> https://issues.apache.org/jira/browse/AVRO-1493). 
> e.g : 
> "SERDEINFO" lowers to "serdeınfo"
> "uniquetable" uppers to "UNİQUETABLE"
> work around : 
> add -Duser.country=US -Duser.language=en to the end of the line 
> SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"
> in spark-shell.sh



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-04-10 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963367#comment-15963367
 ] 

Charles Pritchard commented on SPARK-19352:
---

Does this fix the issue in SPARK-18934 ?

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-04-10 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963365#comment-15963365
 ] 

Charles Pritchard commented on SPARK-19352:
---

[~cloud_fan] Yes, Hive relies on sorting optimizations for running map side 
joins. DISTRIBUTE BY and SORT BY can be used to manually output data into 
single sorted files per partition.
Hive will ensure sorting when running INSERT OVERWRITE statements, when a table 
is created with PARTITIONED BY... CLUSTERED BY... SORTED BY ... INTO 1 BUCKETS.

Spark also reads the Hive metastore to detect when files are already sorted, 
and runs optimizations.

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-04-10 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963353#comment-15963353
 ] 

Michael Armbrust commented on SPARK-19067:
--

No, this will be available in Spark 2.2.0

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.2.0
>
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: GroupState) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait GroupState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> GroupState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while) - 
> Done
> - General expression based expiration - TODO. Any real usecases that cannot 
> be done with timeouts?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20283) Add preOptimizationBatches

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963351#comment-15963351
 ] 

Apache Spark commented on SPARK-20283:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17595

> Add preOptimizationBatches
> --
>
> Key: SPARK-20283
> URL: https://issues.apache.org/jira/browse/SPARK-20283
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
> Let's add it so it is symmetric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20283) Add preOptimizationBatches

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20283:


Assignee: Reynold Xin  (was: Apache Spark)

> Add preOptimizationBatches
> --
>
> Key: SPARK-20283
> URL: https://issues.apache.org/jira/browse/SPARK-20283
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
> This patch adds preOptimizationBatches so the optimizer debugging extensions 
> are symmetric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20283) Add preOptimizationBatches

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20283:


Assignee: Apache Spark  (was: Reynold Xin)

> Add preOptimizationBatches
> --
>
> Key: SPARK-20283
> URL: https://issues.apache.org/jira/browse/SPARK-20283
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
> Let's add it so it is symmetric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20283) Add preOptimizationBatches

2017-04-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20283:

Description: 
We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
This patch adds preOptimizationBatches so the optimizer debugging extensions 
are symmetric.



  was:
We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
Let's add it so it is symmetric.



> Add preOptimizationBatches
> --
>
> Key: SPARK-20283
> URL: https://issues.apache.org/jira/browse/SPARK-20283
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
> This patch adds preOptimizationBatches so the optimizer debugging extensions 
> are symmetric.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20283) Add preOptimizationBatches

2017-04-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20283:
---

 Summary: Add preOptimizationBatches
 Key: SPARK-20283
 URL: https://issues.apache.org/jira/browse/SPARK-20283
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently have postHocOptimizationBatches, but not preOptimizationBatches. 
Let's add it so it is symmetric.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20282:


Assignee: (was: Apache Spark)

> Flaky test: 
> org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
> -
>
> Key: SPARK-20282
> URL: https://issues.apache.org/jira/browse/SPARK-20282
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: flaky-test
>
> I saw the following failure several times:
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Assert on query failed: 
> == Progress ==
>AssertOnQuery(, )
>StopStream
>AddData to MemoryStream[value#30891]: 1,2
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
>CheckAnswer: [6],[3]
>StopStream
> => AssertOnQuery(, )
>AssertOnQuery(, )
>StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
>CheckAnswer: [6],[3]
>StopStream
>AddData to MemoryStream[value#30891]: 3
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
>CheckLastBatch: [2]
>StopStream
>AddData to MemoryStream[value#30891]: 0
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
>ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
>AssertOnQuery(, )
>AssertOnQuery(, incorrect start offset or end offset on 
> exception)
> == Stream ==
> Output Mode: Append
> Stream state: not started
> Thread state: dead
> == Sink ==
> 0: [6] [3]
> == Plan ==
>  
>  
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41)
>

[jira] [Assigned] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20282:


Assignee: Apache Spark

> Flaky test: 
> org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
> -
>
> Key: SPARK-20282
> URL: https://issues.apache.org/jira/browse/SPARK-20282
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>  Labels: flaky-test
>
> I saw the following failure several times:
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Assert on query failed: 
> == Progress ==
>AssertOnQuery(, )
>StopStream
>AddData to MemoryStream[value#30891]: 1,2
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
>CheckAnswer: [6],[3]
>StopStream
> => AssertOnQuery(, )
>AssertOnQuery(, )
>StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
>CheckAnswer: [6],[3]
>StopStream
>AddData to MemoryStream[value#30891]: 3
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
>CheckLastBatch: [2]
>StopStream
>AddData to MemoryStream[value#30891]: 0
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
>ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
>AssertOnQuery(, )
>AssertOnQuery(, incorrect start offset or end offset on 
> exception)
> == Stream ==
> Output Mode: Append
> Stream state: not started
> Thread state: dead
> == Sink ==
> 0: [6] [3]
> == Plan ==
>  
>  
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> 

[jira] [Commented] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963341#comment-15963341
 ] 

Apache Spark commented on SPARK-20282:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17594

> Flaky test: 
> org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
> -
>
> Key: SPARK-20282
> URL: https://issues.apache.org/jira/browse/SPARK-20282
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: flaky-test
>
> I saw the following failure several times:
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Assert on query failed: 
> == Progress ==
>AssertOnQuery(, )
>StopStream
>AddData to MemoryStream[value#30891]: 1,2
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
>CheckAnswer: [6],[3]
>StopStream
> => AssertOnQuery(, )
>AssertOnQuery(, )
>StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
>CheckAnswer: [6],[3]
>StopStream
>AddData to MemoryStream[value#30891]: 3
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
>CheckLastBatch: [2]
>StopStream
>AddData to MemoryStream[value#30891]: 0
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
>ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
>AssertOnQuery(, )
>AssertOnQuery(, incorrect start offset or end offset on 
> exception)
> == Stream ==
> Output Mode: Append
> Stream state: not started
> Thread state: dead
> == Sink ==
> 0: [6] [3]
> == Plan ==
>  
>  
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> 

[jira] [Updated] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception

2017-04-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20282:
-
Issue Type: Test  (was: Bug)

> Flaky test: 
> org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
> -
>
> Key: SPARK-20282
> URL: https://issues.apache.org/jira/browse/SPARK-20282
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: flaky-test
>
> I saw the following failure several times:
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Assert on query failed: 
> == Progress ==
>AssertOnQuery(, )
>StopStream
>AddData to MemoryStream[value#30891]: 1,2
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
>CheckAnswer: [6],[3]
>StopStream
> => AssertOnQuery(, )
>AssertOnQuery(, )
>StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
>CheckAnswer: [6],[3]
>StopStream
>AddData to MemoryStream[value#30891]: 3
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
>CheckLastBatch: [2]
>StopStream
>AddData to MemoryStream[value#30891]: 0
>
> StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
>ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
>AssertOnQuery(, )
>AssertOnQuery(, incorrect start offset or end offset on 
> exception)
> == Stream ==
> Output Mode: Append
> Stream state: not started
> Thread state: dead
> == Sink ==
> 0: [6] [3]
> == Plan ==
>  
>  
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
>   at 
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
>   at 
> org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41)
>   at 
> 

[jira] [Created] (SPARK-20282) Flaky test: org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception

2017-04-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-20282:


 Summary: Flaky test: 
org.apache.spark.sql.streaming/StreamingQuerySuite/OneTime_trigger__commit_log__and_exception
 Key: SPARK-20282
 URL: https://issues.apache.org/jira/browse/SPARK-20282
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Priority: Minor


I saw the following failure several times:
{code}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
Assert on query failed: 

== Progress ==
   AssertOnQuery(, )
   StopStream
   AddData to MemoryStream[value#30891]: 1,2
   StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map())
   CheckAnswer: [6],[3]
   StopStream
=> AssertOnQuery(, )
   AssertOnQuery(, )
   StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map())
   CheckAnswer: [6],[3]
   StopStream
   AddData to MemoryStream[value#30891]: 3
   StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map())
   CheckLastBatch: [2]
   StopStream
   AddData to MemoryStream[value#30891]: 0
   StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map())
   ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
   AssertOnQuery(, )
   AssertOnQuery(, incorrect start offset or end offset on exception)

== Stream ==
Output Mode: Append
Stream state: not started
Thread state: dead


== Sink ==
0: [6] [3]


== Plan ==

 
 
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
at 
org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
at 
org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
at 
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
at 
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
at 
org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQuerySuite.scala:41)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.sql.streaming.StreamingQuerySuite.runTest(StreamingQuerySuite.scala:41)
at 

[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-10 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304
 ] 

Ryan Williams edited comment on SPARK-650 at 4/10/17 6:42 PM:
--

Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

h4. Another use case

I have another need for this atm where regular lazy-object-initialization is 
also insufficient: [due to a rough-edge in Scala programs' classloader 
configuration, {{FileSystemProvider}}'s in user JARs are not loaded 
properly|https://github.com/scala/bug/issues/10247].

[A workaround discussed in the 1st post on that issue fixes the 
problem|https://github.com/hammerlab/spark-commands/blob/1.0.3/src/main/scala/org/hammerlab/commands/FileSystems.scala#L8-L20],
 but needs to be run before {{FileSystemProvider.installedProviders}} is first 
called on the JVM, which can be triggered by numerous {{java.nio.file}} 
operations.

I don't see a clear way to work in code in that will always lazily call my 
{{FileSystems.load}} function on each executor, let alone ensure that it 
happens before any code in the JAR calls e.g.
 {{Paths.get}}.


was (Author: rdub):
Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-10 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963304#comment-15963304
 ] 

Ryan Williams commented on SPARK-650:
-

Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963136#comment-15963136
 ] 

Marcelo Vanzin commented on SPARK-16742:


bq. The problem is then that a kerberos-authenticated user submitting their job 
would be unaware that their credentials are being leaked to other users.

That's the gist of it, yes. But note that it isn't restricted to files. If all 
the user processes are running as the same user, one can just dump the other's 
heap, or connect using JVMTI, and get the credentials. Same problem.

The most basic feature needed for any kerberos-related work is user isolation 
(different users cannot mess with each others' processes). I was under the 
impression that Mesos supported that.

bq. I'm assuming that hadoop.security.auth_to_local is what maps the Kerberos 
user to the Unix user...

I'm not exactly familiar with all the YARN settings but yes, the result you get 
is that the submitting user runs YARN containers as their own user (nor as some 
generic, shared user). Without that, you shouldn't even bother thinking about 
inserting Kerberos in the picture, IMO.

bq. We avoid the shared-file problem for keytabs entirely

See my first comment above, that's not enough.

bq. We're probably going to punt on cluster mode for now

You don't need to punt on cluster mode. I don't know where this notion that 
cluster mode requires you to distribute keytabs comes from; Spark works just 
fine in YARN cluster mode without distributing keytabs. All you need to 
distribute are delegation tokens. Keytabs aren't even necessary to log in and 
submit the app at all (you can use passwords with kinit, after all).

The only thing distributing keytabs buys you is running applications for longer 
than the delegation tokens' max lifetime (normally 7 days by default).

bq. If you see any blockers

Lack of user isolation is always a blocker; without that there's no way to 
prevent one user from seeing another's credentials. But I've asked this in the 
past and the answer I got is that Mesos supports it...

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions

2017-04-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20273.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Disallow Non-deterministic Filter push-down into Join Conditions
> 
>
> Key: SPARK-20273
> URL: https://issues.apache.org/jira/browse/SPARK-20273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> {noformat}
> sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b 
> having r > 0.5").show()
> {noformat}
> We will get the following error:
> {noformat}
> Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> {noformat}
> Filters could be pushed down to the join conditions by the optimizer rule 
> {{PushPredicateThroughJoin}}. However, we block users to add 
> non-deterministics conditions by the analyzer (For details, see the PR 
> https://github.com/apache/spark/pull/7535). 
> We should not push down non-deterministic conditions; otherwise, we should 
> allow users to do it by explicitly initialize the non-deterministic 
> expressions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-04-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19518.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20243.
---
   Resolution: Fixed
 Assignee: Bogdan Raducanu
Fix Version/s: 2.2.0

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.2.0
>
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le

2017-04-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

OK so this is about updating lz4-java in order to add PPC64 support?

> Spark Hive tests aborted due to lz4-java on ppc64le
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>Priority: Minor
>  Labels: ppc64le
> Attachments: hs_err_pid.log
>
>
> The tests are getting aborted in Spark Hive project with the following error :
> {code:borderStyle=solid}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 
> 1.8.0_111-8u111-b14-3~14.04.1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x56f114]
> {code}
> In the thread log file, I found the following traces :
> Event: 3669.042 Thread 0x3fff89976800 Exception  'java/lang/NoClassDefFoundError': Could not initialize class 
> net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at 
> [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp,
>  line 890]
> This error is due to the lz4-java (version 1.3.0), which doesn’t have support 
> for ppc64le.PFA the thread log file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962903#comment-15962903
 ] 

Steve Loughran commented on SPARK-20202:


One thing I do recall as trouble here was that ivy resolution was different 
from mvns, and fixing up all the transitives was a troublespot. Patches here 
need to be tested against SBT and maven —as jenkins only does SBT, the mvn 
builds will have to be manual. I don't remember which specific dependency was 
the problem.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le

2017-04-10 Thread Ayappan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962896#comment-15962896
 ] 

Ayappan commented on SPARK-2:
-

https://github.com/lz4/lz4-java/pull/84

> Spark Hive tests aborted due to lz4-java on ppc64le
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>  Labels: ppc64le
> Attachments: hs_err_pid.log
>
>
> The tests are getting aborted in Spark Hive project with the following error :
> {code:borderStyle=solid}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 
> 1.8.0_111-8u111-b14-3~14.04.1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x56f114]
> {code}
> In the thread log file, I found the following traces :
> Event: 3669.042 Thread 0x3fff89976800 Exception  'java/lang/NoClassDefFoundError': Could not initialize class 
> net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at 
> [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp,
>  line 890]
> This error is due to the lz4-java (version 1.3.0), which doesn’t have support 
> for ppc64le.PFA the thread log file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20277) Allow Spark on YARN to be launched with Docker

2017-04-10 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962787#comment-15962787
 ] 

Zhankun Tang commented on SPARK-20277:
--

It requires at least hadoop 2.8 Alpha1 from the roadmap. I think it won't be 
that long.

> Allow Spark on YARN to be launched with Docker
> --
>
> Key: SPARK-20277
> URL: https://issues.apache.org/jira/browse/SPARK-20277
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Zhankun Tang
>
> Currently YARN is going to support Docker(YARN-3611). 
> We want to empower Spark to support launching Executors via a Docker image 
> that resolving the user's dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20279:
---
Description: 
In web ui,'Only showing 200' should be changed to 'only showing last 200' in 
the page of 'jobs' or stages.

I think, the description about add 'last', the purpose is to ensure that users 
more clearly know that the current show 'jobs or stages', is the latest 200 
'jobs or stages', Or the beginning 200 of the 'jobs or stages'.

please see the attachment.



> In web ui,'Only showing 200' should be changed to 'only showing last 200'.
> --
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: jobs.png, stages.png
>
>
> In web ui,'Only showing 200' should be changed to 'only showing last 200' in 
> the page of 'jobs' or stages.
> I think, the description about add 'last', the purpose is to ensure that 
> users more clearly know that the current show 'jobs or stages', is the latest 
> 200 'jobs or stages', Or the beginning 200 of the 'jobs or stages'.
> please see the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20279:
---
Attachment: stages.png
jobs.png

> In web ui,'Only showing 200' should be changed to 'only showing last 200'.
> --
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: jobs.png, stages.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20279:


Assignee: (was: Apache Spark)

> In web ui,'Only showing 200' should be changed to 'only showing last 200'.
> --
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20279:


Assignee: Apache Spark

> In web ui,'Only showing 200' should be changed to 'only showing last 200'.
> --
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962763#comment-15962763
 ] 

Apache Spark commented on SPARK-20279:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17593

> In web ui,'Only showing 200' should be changed to 'only showing last 200'.
> --
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20243:


Assignee: (was: Apache Spark)

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20243:


Assignee: Apache Spark

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Apache Spark
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962761#comment-15962761
 ] 

Apache Spark commented on SPARK-20243:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/17592

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20281) Table-valued function range in SQL should use the same number of partitions as spark.range

2017-04-10 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20281:
---

 Summary: Table-valued function range in SQL should use the same 
number of partitions as spark.range
 Key: SPARK-20281
 URL: https://issues.apache.org/jira/browse/SPARK-20281
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


Note the different number of partitions in {{range}} in SQL and as operator.

{code}
scala> spark.range(4).explain
== Physical Plan ==
*Range (0, 4, step=1, splits=Some(8)) // <-- note Some(8)

scala> sql("select * from range(4)").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=None) // <-- note None
{code}

If I'm not mistaken, the change is to fix {{builtinFunctions}} in 
{{ResolveTableValuedFunctions}} (see 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTableValuedFunctions.scala#L82-L93])
 to use {{sparkContext.defaultParallelism}} as {{SparkSession.range}} (see 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L517]).

Please confirm to work on a fix if and as needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20279:
---
Summary: In web ui,'Only showing 200' should be changed to 'only showing 
last 200'.  (was: In web ui,'Only showing 200' should be changed to 'only 
showing last 200' in the page of 'jobs' or'stages'.)

> In web ui,'Only showing 200' should be changed to 'only showing last 200'.
> --
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200' in the page of 'jobs' or'stages'.

2017-04-10 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20279:
---
Summary: In web ui,'Only showing 200' should be changed to 'only showing 
last 200' in the page of 'jobs' or'stages'.  (was: In web ui,'Only showing 200' 
should be changed to 'only showing last 200'.)

> In web ui,'Only showing 200' should be changed to 'only showing last 200' in 
> the page of 'jobs' or'stages'.
> ---
>
> Key: SPARK-20279
> URL: https://issues.apache.org/jira/browse/SPARK-20279
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20280:


Assignee: (was: Apache Spark)

> SharedInMemoryCache Weigher integer overflow
> 
>
> Key: SPARK-20280
> URL: https://issues.apache.org/jira/browse/SPARK-20280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bogdan Raducanu
>
> in FileStatusCache.scala:
> {code}
> .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
>   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
> Int = {
> (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
>   }})
> {code}
> Weigher.weigh returns Int but the size of an Array[FileStatus] could be 
> bigger than Int.maxValue. Then, a negative value is returned, leading to this 
> exception:
> {code}
> * [info]   java.lang.IllegalStateException: Weights must be non-negative
> * [info]   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
> * [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
> * [info]   at 
> com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> * [info]   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962751#comment-15962751
 ] 

Apache Spark commented on SPARK-20280:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/17591

> SharedInMemoryCache Weigher integer overflow
> 
>
> Key: SPARK-20280
> URL: https://issues.apache.org/jira/browse/SPARK-20280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bogdan Raducanu
>
> in FileStatusCache.scala:
> {code}
> .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
>   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
> Int = {
> (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
>   }})
> {code}
> Weigher.weigh returns Int but the size of an Array[FileStatus] could be 
> bigger than Int.maxValue. Then, a negative value is returned, leading to this 
> exception:
> {code}
> * [info]   java.lang.IllegalStateException: Weights must be non-negative
> * [info]   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
> * [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
> * [info]   at 
> com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> * [info]   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20280:


Assignee: Apache Spark

> SharedInMemoryCache Weigher integer overflow
> 
>
> Key: SPARK-20280
> URL: https://issues.apache.org/jira/browse/SPARK-20280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Apache Spark
>
> in FileStatusCache.scala:
> {code}
> .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
>   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
> Int = {
> (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
>   }})
> {code}
> Weigher.weigh returns Int but the size of an Array[FileStatus] could be 
> bigger than Int.maxValue. Then, a negative value is returned, leading to this 
> exception:
> {code}
> * [info]   java.lang.IllegalStateException: Weights must be non-negative
> * [info]   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
> * [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
> * [info]   at 
> com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> * [info]   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962750#comment-15962750
 ] 

Sean Owen commented on SPARK-20280:
---

I guess cap it at {{Int.MaxValue}}?

> SharedInMemoryCache Weigher integer overflow
> 
>
> Key: SPARK-20280
> URL: https://issues.apache.org/jira/browse/SPARK-20280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bogdan Raducanu
>
> in FileStatusCache.scala:
> {code}
> .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
>   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
> Int = {
> (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
>   }})
> {code}
> Weigher.weigh returns Int but the size of an Array[FileStatus] could be 
> bigger than Int.maxValue. Then, a negative value is returned, leading to this 
> exception:
> {code}
> * [info]   java.lang.IllegalStateException: Weights must be non-negative
> * [info]   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
> * [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
> * [info]   at 
> com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> * [info]   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962731#comment-15962731
 ] 

Wenchen Fan edited comment on SPARK-12837 at 4/10/17 11:54 AM:
---

[~jbherman] I tried and debugged your example, the actual task result is around 
1kb, when the number of shuffle partitions is 1k, the total result size will be 
around 1mb. I'm trying to reduce the overhead of accumulator, but you have to 
increase the `spark.driver.maxResultSize` anyway, 1mb is too small for this job.


was (Author: cloud_fan):
[~jbherman] I tried and debugged your example, the actual task result is around 
1kb, when the number of shuffle partitions is 1k, the total result size will be 
around 1mb. I'm trying to reduce the overhead of accumulator, but you have to 
increase the `spark.driver.maxResultSize` anyway.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-20280:

Description: 
in FileStatusCache.scala:
{code}
.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int 
= {
(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
  }})
{code}

Weigher.weigh returns Int but the size of an Array[FileStatus] could be bigger 
than Int.maxValue. Then, a negative value is returned, leading to this 
exception:

{code}
* [info]   java.lang.IllegalStateException: Weights must be non-negative
* [info]   at 
com.google.common.base.Preconditions.checkState(Preconditions.java:149)
* [info]   at 
com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
* [info]   at 
com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
* [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
* [info]   at 
com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
* [info]   at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)

{code}

  was:
{code}
.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int 
= {
(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
  }})
{code}

Weigher.weigh returns Int but the size of an Array[FileStatus] could be bigger 
than Int.maxValue. Then, a negative value is returned, leading to this 
exception:

{code}
* [info]   java.lang.IllegalStateException: Weights must be non-negative
* [info]   at 
com.google.common.base.Preconditions.checkState(Preconditions.java:149)
* [info]   at 
com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
* [info]   at 
com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
* [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
* [info]   at 
com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
* [info]   at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)

{code}


> SharedInMemoryCache Weigher integer overflow
> 
>
> Key: SPARK-20280
> URL: https://issues.apache.org/jira/browse/SPARK-20280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Bogdan Raducanu
>
> in FileStatusCache.scala:
> {code}
> .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
>   override def weigh(key: (ClientId, Path), value: Array[FileStatus]): 
> Int = {
> (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
>   }})
> {code}
> Weigher.weigh returns Int but the size of an Array[FileStatus] could be 
> bigger than Int.maxValue. Then, a negative value is returned, leading to this 
> exception:
> {code}
> * [info]   java.lang.IllegalStateException: Weights must be non-negative
> * [info]   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
> * [info]   at 
> com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
> * [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
> * [info]   at 
> com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> * [info]   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20280) SharedInMemoryCache Weigher integer overflow

2017-04-10 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-20280:
---

 Summary: SharedInMemoryCache Weigher integer overflow
 Key: SPARK-20280
 URL: https://issues.apache.org/jira/browse/SPARK-20280
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.2.0
Reporter: Bogdan Raducanu


{code}
.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] {
  override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int 
= {
(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt
  }})
{code}

Weigher.weigh returns Int but the size of an Array[FileStatus] could be bigger 
than Int.maxValue. Then, a negative value is returned, leading to this 
exception:

{code}
* [info]   java.lang.IllegalStateException: Weights must be non-negative
* [info]   at 
com.google.common.base.Preconditions.checkState(Preconditions.java:149)
* [info]   at 
com.google.common.cache.LocalCache$Segment.setValue(LocalCache.java:2223)
* [info]   at 
com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2944)
* [info]   at com.google.common.cache.LocalCache.put(LocalCache.java:4212)
* [info]   at 
com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
* [info]   at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:131)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20279) In web ui,'Only showing 200' should be changed to 'only showing last 200'.

2017-04-10 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-20279:
--

 Summary: In web ui,'Only showing 200' should be changed to 'only 
showing last 200'.
 Key: SPARK-20279
 URL: https://issues.apache.org/jira/browse/SPARK-20279
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.1.0
Reporter: guoxiaolongzte
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962731#comment-15962731
 ] 

Wenchen Fan commented on SPARK-12837:
-

[~jbherman] I tried and debugged your example, the actual task result is around 
1kb, when the number of shuffle partitions is 1k, the total result size will be 
around 1mb. I'm trying to reduce the overhead of accumulator, but you have to 
increase the `spark.driver.maxResultSize` anyway.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20278:


Assignee: Apache Spark

> Disable 'multiple_dots_linter' lint rule that is against project's code style
> -
>
> Key: SPARK-20278
> URL: https://issues.apache.org/jira/browse/SPARK-20278
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, multi-dot separated variables in R is not allowed. For example,
> {code}
>  setMethod("from_json", signature(x = "Column", schema = "structType"),
> -  function(x, schema, asJsonArray = FALSE, ...) {
> +  function(x, schema, as.json.array = FALSE, ...) {
>  if (asJsonArray) {
>jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
>   "createArrayType",
> {code}
> produces an error as below:
> {code}
> R/functions.R:2462:31: style: Words within variable and function names should 
> be separated by '_' rather than '.'.
>   function(x, schema, as.json.array = FALSE, ...) {
>   ^
> {code}
> This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
> which says
> {quote}
>  The preferred form for variable names is all lower case letters and words 
> separated with dots
> {quote}
> This looks because lintr https://github.com/jimhester/lintr follows 
> http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases 
> seems not following Google's one.
> Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
> https://google.github.io/styleguide/Rguide.xml. This is also merged into 
> Spark's website - https://github.com/apache/spark-website/pull/43
> Also, we have no limit on function name. This rule also looks affecting to 
> the name of functions as written in the README.md.
> {quote}
> multiple_dots_linter: check that function and variable names are separated by 
> _ rather than ..
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style

2017-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20278:


Assignee: (was: Apache Spark)

> Disable 'multiple_dots_linter' lint rule that is against project's code style
> -
>
> Key: SPARK-20278
> URL: https://issues.apache.org/jira/browse/SPARK-20278
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, multi-dot separated variables in R is not allowed. For example,
> {code}
>  setMethod("from_json", signature(x = "Column", schema = "structType"),
> -  function(x, schema, asJsonArray = FALSE, ...) {
> +  function(x, schema, as.json.array = FALSE, ...) {
>  if (asJsonArray) {
>jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
>   "createArrayType",
> {code}
> produces an error as below:
> {code}
> R/functions.R:2462:31: style: Words within variable and function names should 
> be separated by '_' rather than '.'.
>   function(x, schema, as.json.array = FALSE, ...) {
>   ^
> {code}
> This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
> which says
> {quote}
>  The preferred form for variable names is all lower case letters and words 
> separated with dots
> {quote}
> This looks because lintr https://github.com/jimhester/lintr follows 
> http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases 
> seems not following Google's one.
> Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
> https://google.github.io/styleguide/Rguide.xml. This is also merged into 
> Spark's website - https://github.com/apache/spark-website/pull/43
> Also, we have no limit on function name. This rule also looks affecting to 
> the name of functions as written in the README.md.
> {quote}
> multiple_dots_linter: check that function and variable names are separated by 
> _ rather than ..
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style

2017-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962656#comment-15962656
 ] 

Apache Spark commented on SPARK-20278:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17590

> Disable 'multiple_dots_linter' lint rule that is against project's code style
> -
>
> Key: SPARK-20278
> URL: https://issues.apache.org/jira/browse/SPARK-20278
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, multi-dot separated variables in R is not allowed. For example,
> {code}
>  setMethod("from_json", signature(x = "Column", schema = "structType"),
> -  function(x, schema, asJsonArray = FALSE, ...) {
> +  function(x, schema, as.json.array = FALSE, ...) {
>  if (asJsonArray) {
>jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
>   "createArrayType",
> {code}
> produces an error as below:
> {code}
> R/functions.R:2462:31: style: Words within variable and function names should 
> be separated by '_' rather than '.'.
>   function(x, schema, as.json.array = FALSE, ...) {
>   ^
> {code}
> This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
> which says
> {quote}
>  The preferred form for variable names is all lower case letters and words 
> separated with dots
> {quote}
> This looks because lintr https://github.com/jimhester/lintr follows 
> http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases 
> seems not following Google's one.
> Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
> https://google.github.io/styleguide/Rguide.xml. This is also merged into 
> Spark's website - https://github.com/apache/spark-website/pull/43
> Also, we have no limit on function name. This rule also looks affecting to 
> the name of functions as written in the README.md.
> {quote}
> multiple_dots_linter: check that function and variable names are separated by 
> _ rather than ..
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20278) Disable 'multiple_dots_linter; lint rule that is against project's code style

2017-04-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20278:
-
Description: 
Currently, multi-dot separated variables in R is not allowed. For example,

{code}
 setMethod("from_json", signature(x = "Column", schema = "structType"),
-  function(x, schema, asJsonArray = FALSE, ...) {
+  function(x, schema, as.json.array = FALSE, ...) {
 if (asJsonArray) {
   jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
  "createArrayType",
{code}

produces an error as below:

{code}
R/functions.R:2462:31: style: Words within variable and function names should 
be separated by '_' rather than '.'.
  function(x, schema, as.json.array = FALSE, ...) {
  ^
{code}

This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
which says

{quote}
 The preferred form for variable names is all lower case letters and words 
separated with dots
{quote}

This looks because lintr https://github.com/jimhester/lintr follows 
http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems 
not following Google's one.

Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
https://google.github.io/styleguide/Rguide.xml. This is also merged into 
Spark's website - https://github.com/apache/spark-website/pull/43

Also, we have no limit on function name. This rule also looks affecting to the 
name of functions as written in the README.md.

{quote}
multiple_dots_linter: check that function and variable names are separated by _ 
rather than ..
{quote}


  was:
Currently, multi-dot separated variables in R is not allowed. For example,

{code}
 setMethod("from_json", signature(x = "Column", schema = "structType"),
-  function(x, schema, asJsonArray = FALSE, ...) {
+  function(x, schema, as.json.array = FALSE, ...) {
 if (asJsonArray) {
   jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
  "createArrayType",
{code}

produces an error as below:

{code}
R/functions.R:2462:31: style: Words within variable and function names should 
be separated by '_' rather than '.'.
  function(x, schema, as.json.array = FALSE, ...) {
  ^
{code}

This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
which says

{quote}
 The preferred form for variable names is all lower case letters and words 
separated with dots
{quote}

This looks because lintr https://github.com/jimhester/lintr follows 
http://r-pkgs.had.co.nz/style.html as written in the README.md. This guide line 
seems against the rule.

Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
https://google.github.io/styleguide/Rguide.xml. This is also merged into 
Spark's website - https://github.com/apache/spark-website/pull/43

Also, we have no limit on function name. This rule also looks affecting to the 
name of functions as written in the README.md.

{quote}
multiple_dots_linter: check that function and variable names are separated by _ 
rather than ..
{quote}



> Disable 'multiple_dots_linter; lint rule that is against project's code style
> -
>
> Key: SPARK-20278
> URL: https://issues.apache.org/jira/browse/SPARK-20278
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, multi-dot separated variables in R is not allowed. For example,
> {code}
>  setMethod("from_json", signature(x = "Column", schema = "structType"),
> -  function(x, schema, asJsonArray = FALSE, ...) {
> +  function(x, schema, as.json.array = FALSE, ...) {
>  if (asJsonArray) {
>jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
>   "createArrayType",
> {code}
> produces an error as below:
> {code}
> R/functions.R:2462:31: style: Words within variable and function names should 
> be separated by '_' rather than '.'.
>   function(x, schema, as.json.array = FALSE, ...) {
>   ^
> {code}
> This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
> which says
> {quote}
>  The preferred form for variable names is all lower case letters and words 
> separated with dots
> {quote}
> This looks because lintr https://github.com/jimhester/lintr follows 
> http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases 
> seems not following Google's one.
> Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
> 

[jira] [Updated] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style

2017-04-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20278:
-
Summary: Disable 'multiple_dots_linter' lint rule that is against project's 
code style  (was: Disable 'multiple_dots_linter; lint rule that is against 
project's code style)

> Disable 'multiple_dots_linter' lint rule that is against project's code style
> -
>
> Key: SPARK-20278
> URL: https://issues.apache.org/jira/browse/SPARK-20278
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, multi-dot separated variables in R is not allowed. For example,
> {code}
>  setMethod("from_json", signature(x = "Column", schema = "structType"),
> -  function(x, schema, asJsonArray = FALSE, ...) {
> +  function(x, schema, as.json.array = FALSE, ...) {
>  if (asJsonArray) {
>jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
>   "createArrayType",
> {code}
> produces an error as below:
> {code}
> R/functions.R:2462:31: style: Words within variable and function names should 
> be separated by '_' rather than '.'.
>   function(x, schema, as.json.array = FALSE, ...) {
>   ^
> {code}
> This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
> which says
> {quote}
>  The preferred form for variable names is all lower case letters and words 
> separated with dots
> {quote}
> This looks because lintr https://github.com/jimhester/lintr follows 
> http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases 
> seems not following Google's one.
> Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
> https://google.github.io/styleguide/Rguide.xml. This is also merged into 
> Spark's website - https://github.com/apache/spark-website/pull/43
> Also, we have no limit on function name. This rule also looks affecting to 
> the name of functions as written in the README.md.
> {quote}
> multiple_dots_linter: check that function and variable names are separated by 
> _ rather than ..
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20278) Disable 'multiple_dots_linter; lint rule that is against project's code style

2017-04-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-20278:


 Summary: Disable 'multiple_dots_linter; lint rule that is against 
project's code style
 Key: SPARK-20278
 URL: https://issues.apache.org/jira/browse/SPARK-20278
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, multi-dot separated variables in R is not allowed. For example,

{code}
 setMethod("from_json", signature(x = "Column", schema = "structType"),
-  function(x, schema, asJsonArray = FALSE, ...) {
+  function(x, schema, as.json.array = FALSE, ...) {
 if (asJsonArray) {
   jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
  "createArrayType",
{code}

produces an error as below:

{code}
R/functions.R:2462:31: style: Words within variable and function names should 
be separated by '_' rather than '.'.
  function(x, schema, as.json.array = FALSE, ...) {
  ^
{code}

This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
which says

{quote}
 The preferred form for variable names is all lower case letters and words 
separated with dots
{quote}

This looks because lintr https://github.com/jimhester/lintr follows 
http://r-pkgs.had.co.nz/style.html as written in the README.md. This guide line 
seems against the rule.

Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
https://google.github.io/styleguide/Rguide.xml. This is also merged into 
Spark's website - https://github.com/apache/spark-website/pull/43

Also, we have no limit on function name. This rule also looks affecting to the 
name of functions as written in the README.md.

{quote}
multiple_dots_linter: check that function and variable names are separated by _ 
rather than ..
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row

2017-04-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962636#comment-15962636
 ] 

Sean Owen commented on SPARK-20252:
---

It's likely related to other spark-shell + case class issues, whether exactly 
the same or not. Those are really issues with the Scala shell.
Stopping and starting contexts isn't supported.
If you have a lead on a reliable fix, propose it, but otherwise this is why I 
closed this. Generally, don't reopen issues without new info.

> java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
> ---
>
> Key: SPARK-20252
> URL: https://issues.apache.org/jira/browse/SPARK-20252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.3
> Environment: Datastax DSE dual node SPARK cluster
> [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native 
> protocol v4]
>Reporter: Peter Mead
>
> After starting a spark shell using DSE -u  -p x spark
> scala> case class movie_row (actor: String, character_name: String, video_id: 
> java.util.UUID, video_year: Int, title: String)
> defined class movie_row
> scala> val vids=sc.cassandraTable("hcl","videos_by_actor")
> vids: 
> com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
>  = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
> scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
> vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
> CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15
> scala> vids.count
> res0: Long = 114961
>  Works OK!!
> BUT if the spark context is stopped and recreated THEN:
> scala> sc.stop()
> scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, 
> org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val conf = new SparkConf(true)
> .set("spark.cassandra.connection.host", "redacted")
> .set("spark.cassandra.auth.username", "redacted")
> .set("spark.cassandra.auth.password", "redacted")
> // Exiting paste mode, now interpreting.
> conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342
> scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf)
> sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8
> scala> case class movie_row (actor: String, character_name: String, video_id: 
> java.util.UUID, video_year: Int, title: String)
> defined class movie_row
> scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
> vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
> CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
> scala> vids.count
> [Stage 0:>  (0 + 2) / 
> 2]WARN  2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: 
> Lost task 0.0 in stage 0.0 (TID 0, cassandra183): 
> java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
> FAILS!!
> I have been unable to get this to work from a remote SPARK shell!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >