Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
Thanks Reynold.

Not sure why doExecute is not invoked, since CollectLimit does not support 
wholeStage

case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode {

I will dig further into this.

Zhan Zhang

On Apr 18, 2016, at 10:36 PM, Reynold Xin 
> wrote:

Anyway we can verify this easily. I just added a println to each row and 
verified that only limit + 1 row was printed after the join and before the 
limit.

It'd be great if you do some debugging yourself and see if it is going through 
some other code path.


On Mon, Apr 18, 2016 at 10:35 PM, Reynold Xin 
> wrote:
But doExecute is not called?

On Mon, Apr 18, 2016 at 10:32 PM, Zhan Zhang 
> wrote:
Hi Reynold,

I just check the code for CollectLimit, there is a shuffle happening to collect 
them in one partition.


protected override def doExecute(): RDD[InternalRow] = {
  val shuffled = new ShuffledRowRDD(
ShuffleExchange.prepareShuffleDependency(
  child.execute(), child.output, SinglePartition, serializer))
  shuffled.mapPartitionsInternal(_.take(limit))
}

Thus, there is no way to avoid processing all data before the shuffle. I think 
that is the reason. Do I understand correctly?

Thanks.

Zhan Zhang
On Apr 18, 2016, at 10:08 PM, Reynold Xin 
> wrote:

Unless I'm really missing something I don't think so. As I said, it goes 
through an iterator and after processing each stream side we do a shouldStop 
check. The generated code looks like

/* 094 */   protected void processNext() throws java.io.IOException {
/* 095 */ /*** PRODUCE: Project [id#79L] */
/* 096 */
/* 097 */ /*** PRODUCE: BroadcastHashJoin [id#79L], [id#82L], Inner, 
BuildRight, None */
/* 098 */
/* 099 */ /*** PRODUCE: Range 0, 1, 8, 100, [id#79L] */
/* 100 */
/* 101 */ // initialize Range
/* 102 */ if (!range_initRange) {
/* 103 */   range_initRange = true;
/* 104 */   initRange(partitionIndex);
/* 105 */ }
/* 106 */
/* 107 */ while (!range_overflow && range_number < range_partitionEnd) {
/* 108 */   long range_value = range_number;
/* 109 */   range_number += 1L;
/* 110 */   if (range_number < range_value ^ 1L < 0) {
/* 111 */ range_overflow = true;
/* 112 */   }
/* 113 */
/* 114 */   /*** CONSUME: BroadcastHashJoin [id#79L], [id#82L], Inner, 
BuildRight, None */
/* 115 */
/* 116 */   // generate join key for stream side
/* 117 */
/* 118 */   // find matches from HashedRelation
/* 119 */   UnsafeRow bhj_matched = false ? null: 
(UnsafeRow)bhj_relation.getValue(range_value);
/* 120 */   if (bhj_matched == null) continue;
/* 121 */
/* 122 */   bhj_metricValue.add(1);
/* 123 */
/* 124 */   /*** CONSUME: Project [id#79L] */
/* 125 */
/* 126 */   System.out.println("i got one row");
/* 127 */
/* 128 */   /*** CONSUME: WholeStageCodegen */
/* 129 */
/* 130 */   project_rowWriter.write(0, range_value);
/* 131 */   append(project_result);
/* 132 */
/* 133 */   if (shouldStop()) return;
/* 134 */ }
/* 135 */   }
/* 136 */ }


shouldStop is false once we go pass the limit.



On Mon, Apr 18, 2016 at 9:44 PM, Zhan Zhang 
> wrote:
>From the physical plan, the limit is one level up than the WholeStageCodegen, 
>Thus, I don’t think shouldStop would work here. To move it work, the limit has 
>to be part of the wholeStageCodeGen.

Correct me if I am wrong.

Thanks.

Zhan Zhang

On Apr 18, 2016, at 11:09 AM, Reynold Xin 
> wrote:

I could be wrong but I think we currently do that through whole stage codegen. 
After processing every row on the stream side, the generated code for broadcast 
join checks whether it has hit the limit or not (through this thing called 
shouldStop).

It is not the most optimal solution, because a single stream side row might 
output multiple hits, but it is usually not a problem.


On Mon, Apr 18, 2016 at 10:46 AM, Andrew Ray 
> wrote:
While you can't automatically push the limit *through* the join, we could push 
it *into* the join (stop processing after generating 10 records). I believe 
that is what Rajesh is suggesting.

On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier 
> wrote:
I am not sure if you can push a limit through a join. This becomes problematic 
if not all keys are present on both sides; in such a case a limit can produce 
fewer rows than the set limit.

This might be a rare case in which whole stage codegen is slower, due to the 
fact that we need to buffer the result of such a stage. You could try to 
disable it by setting "spark.sql.codegen.wholeStage" to false.

2016-04-12 14:32 GMT+02:00 Rajesh 

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
Anyway we can verify this easily. I just added a println to each row and
verified that only limit + 1 row was printed after the join and before the
limit.

It'd be great if you do some debugging yourself and see if it is going
through some other code path.


On Mon, Apr 18, 2016 at 10:35 PM, Reynold Xin  wrote:

> But doExecute is not called?
>
> On Mon, Apr 18, 2016 at 10:32 PM, Zhan Zhang 
> wrote:
>
>> Hi Reynold,
>>
>> I just check the code for CollectLimit, there is a shuffle happening to
>> collect them in one partition.
>>
>> protected override def doExecute(): RDD[InternalRow] = {
>>   val shuffled = new ShuffledRowRDD(
>> ShuffleExchange.prepareShuffleDependency(
>>   child.execute(), child.output, SinglePartition, serializer))
>>   shuffled.mapPartitionsInternal(_.take(limit))
>> }
>>
>> Thus, there is no way to avoid processing all data before the shuffle. I
>> think that is the reason. Do I understand correctly?
>>
>> Thanks.
>>
>> Zhan Zhang
>> On Apr 18, 2016, at 10:08 PM, Reynold Xin  wrote:
>>
>> Unless I'm really missing something I don't think so. As I said, it goes
>> through an iterator and after processing each stream side we do a
>> shouldStop check. The generated code looks like
>>
>> /* 094 */   protected void processNext() throws java.io.IOException {
>> /* 095 */ /*** PRODUCE: Project [id#79L] */
>> /* 096 */
>> /* 097 */ /*** PRODUCE: BroadcastHashJoin [id#79L], [id#82L], Inner,
>> BuildRight, None */
>> /* 098 */
>> /* 099 */ /*** PRODUCE: Range 0, 1, 8, 100, [id#79L] */
>> /* 100 */
>> /* 101 */ // initialize Range
>> /* 102 */ if (!range_initRange) {
>> /* 103 */   range_initRange = true;
>> /* 104 */   initRange(partitionIndex);
>> /* 105 */ }
>> /* 106 */
>> /* 107 */ while (!range_overflow && range_number <
>> range_partitionEnd) {
>> /* 108 */   long range_value = range_number;
>> /* 109 */   range_number += 1L;
>> /* 110 */   if (range_number < range_value ^ 1L < 0) {
>> /* 111 */ range_overflow = true;
>> /* 112 */   }
>> /* 113 */
>> /* 114 */   /*** CONSUME: BroadcastHashJoin [id#79L], [id#82L],
>> Inner, BuildRight, None */
>> /* 115 */
>> /* 116 */   // generate join key for stream side
>> /* 117 */
>> /* 118 */   // find matches from HashedRelation
>> /* 119 */   UnsafeRow bhj_matched = false ? null:
>> (UnsafeRow)bhj_relation.getValue(range_value);
>> /* 120 */   if (bhj_matched == null) continue;
>> /* 121 */
>> /* 122 */   bhj_metricValue.add(1);
>> /* 123 */
>> /* 124 */   /*** CONSUME: Project [id#79L] */
>> /* 125 */
>> /* 126 */   System.out.println("i got one row");
>> /* 127 */
>> /* 128 */   /*** CONSUME: WholeStageCodegen */
>> /* 129 */
>> /* 130 */   project_rowWriter.write(0, range_value);
>> /* 131 */   append(project_result);
>> /* 132 */
>> */* 133 */   if (shouldStop()) return;*
>> /* 134 */ }
>> /* 135 */   }
>> /* 136 */ }
>>
>>
>> shouldStop is false once we go pass the limit.
>>
>>
>>
>> On Mon, Apr 18, 2016 at 9:44 PM, Zhan Zhang 
>> wrote:
>>
>>> From the physical plan, the limit is one level up than the
>>> WholeStageCodegen, Thus, I don’t think shouldStop would work here. To move
>>> it work, the limit has to be part of the wholeStageCodeGen.
>>>
>>> Correct me if I am wrong.
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> On Apr 18, 2016, at 11:09 AM, Reynold Xin  wrote:
>>>
>>> I could be wrong but I think we currently do that through whole stage
>>> codegen. After processing every row on the stream side, the generated code
>>> for broadcast join checks whether it has hit the limit or not (through this
>>> thing called shouldStop).
>>>
>>> It is not the most optimal solution, because a single stream side row
>>> might output multiple hits, but it is usually not a problem.
>>>
>>>
>>> On Mon, Apr 18, 2016 at 10:46 AM, Andrew Ray 
>>> wrote:
>>>
 While you can't automatically push the limit *through* the join, we
 could push it *into* the join (stop processing after generating 10
 records). I believe that is what Rajesh is suggesting.

 On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier <
 hvanhov...@questtec.nl> wrote:

> I am not sure if you can push a limit through a join. This becomes
> problematic if not all keys are present on both sides; in such a case a
> limit can produce fewer rows than the set limit.
>
> This might be a rare case in which whole stage codegen is slower, due
> to the fact that we need to buffer the result of such a stage. You could
> try to disable it by setting "spark.sql.codegen.wholeStage" to false.
>
> 2016-04-12 14:32 GMT+02:00 Rajesh Balamohan <
> rajesh.balamo...@gmail.com>:
>
>> Hi,
>>
>> I ran the following query in spark (latest master codebase) and it
>> 

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
But doExecute is not called?

On Mon, Apr 18, 2016 at 10:32 PM, Zhan Zhang  wrote:

> Hi Reynold,
>
> I just check the code for CollectLimit, there is a shuffle happening to
> collect them in one partition.
>
> protected override def doExecute(): RDD[InternalRow] = {
>   val shuffled = new ShuffledRowRDD(
> ShuffleExchange.prepareShuffleDependency(
>   child.execute(), child.output, SinglePartition, serializer))
>   shuffled.mapPartitionsInternal(_.take(limit))
> }
>
> Thus, there is no way to avoid processing all data before the shuffle. I
> think that is the reason. Do I understand correctly?
>
> Thanks.
>
> Zhan Zhang
> On Apr 18, 2016, at 10:08 PM, Reynold Xin  wrote:
>
> Unless I'm really missing something I don't think so. As I said, it goes
> through an iterator and after processing each stream side we do a
> shouldStop check. The generated code looks like
>
> /* 094 */   protected void processNext() throws java.io.IOException {
> /* 095 */ /*** PRODUCE: Project [id#79L] */
> /* 096 */
> /* 097 */ /*** PRODUCE: BroadcastHashJoin [id#79L], [id#82L], Inner,
> BuildRight, None */
> /* 098 */
> /* 099 */ /*** PRODUCE: Range 0, 1, 8, 100, [id#79L] */
> /* 100 */
> /* 101 */ // initialize Range
> /* 102 */ if (!range_initRange) {
> /* 103 */   range_initRange = true;
> /* 104 */   initRange(partitionIndex);
> /* 105 */ }
> /* 106 */
> /* 107 */ while (!range_overflow && range_number < range_partitionEnd)
> {
> /* 108 */   long range_value = range_number;
> /* 109 */   range_number += 1L;
> /* 110 */   if (range_number < range_value ^ 1L < 0) {
> /* 111 */ range_overflow = true;
> /* 112 */   }
> /* 113 */
> /* 114 */   /*** CONSUME: BroadcastHashJoin [id#79L], [id#82L], Inner,
> BuildRight, None */
> /* 115 */
> /* 116 */   // generate join key for stream side
> /* 117 */
> /* 118 */   // find matches from HashedRelation
> /* 119 */   UnsafeRow bhj_matched = false ? null:
> (UnsafeRow)bhj_relation.getValue(range_value);
> /* 120 */   if (bhj_matched == null) continue;
> /* 121 */
> /* 122 */   bhj_metricValue.add(1);
> /* 123 */
> /* 124 */   /*** CONSUME: Project [id#79L] */
> /* 125 */
> /* 126 */   System.out.println("i got one row");
> /* 127 */
> /* 128 */   /*** CONSUME: WholeStageCodegen */
> /* 129 */
> /* 130 */   project_rowWriter.write(0, range_value);
> /* 131 */   append(project_result);
> /* 132 */
> */* 133 */   if (shouldStop()) return;*
> /* 134 */ }
> /* 135 */   }
> /* 136 */ }
>
>
> shouldStop is false once we go pass the limit.
>
>
>
> On Mon, Apr 18, 2016 at 9:44 PM, Zhan Zhang 
> wrote:
>
>> From the physical plan, the limit is one level up than the
>> WholeStageCodegen, Thus, I don’t think shouldStop would work here. To move
>> it work, the limit has to be part of the wholeStageCodeGen.
>>
>> Correct me if I am wrong.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Apr 18, 2016, at 11:09 AM, Reynold Xin  wrote:
>>
>> I could be wrong but I think we currently do that through whole stage
>> codegen. After processing every row on the stream side, the generated code
>> for broadcast join checks whether it has hit the limit or not (through this
>> thing called shouldStop).
>>
>> It is not the most optimal solution, because a single stream side row
>> might output multiple hits, but it is usually not a problem.
>>
>>
>> On Mon, Apr 18, 2016 at 10:46 AM, Andrew Ray 
>> wrote:
>>
>>> While you can't automatically push the limit *through* the join, we
>>> could push it *into* the join (stop processing after generating 10
>>> records). I believe that is what Rajesh is suggesting.
>>>
>>> On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@questtec.nl> wrote:
>>>
 I am not sure if you can push a limit through a join. This becomes
 problematic if not all keys are present on both sides; in such a case a
 limit can produce fewer rows than the set limit.

 This might be a rare case in which whole stage codegen is slower, due
 to the fact that we need to buffer the result of such a stage. You could
 try to disable it by setting "spark.sql.codegen.wholeStage" to false.

 2016-04-12 14:32 GMT+02:00 Rajesh Balamohan :

> Hi,
>
> I ran the following query in spark (latest master codebase) and it
> took a lot of time to complete even though it was a broadcast hash join.
>
> It appears that limit computation is done only after computing
> complete join condition.  Shouldn't the limit condition be pushed to
> BroadcastHashJoin (wherein it would have to stop processing after
> generating 10 rows?).  Please let me know if my understanding on this is
> wrong.
>
>
> select l_partkey from lineitem, partsupp where ps_partkey=l_partkey

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
Hi Reynold,

I just check the code for CollectLimit, there is a shuffle happening to collect 
them in one partition.


protected override def doExecute(): RDD[InternalRow] = {
  val shuffled = new ShuffledRowRDD(
ShuffleExchange.prepareShuffleDependency(
  child.execute(), child.output, SinglePartition, serializer))
  shuffled.mapPartitionsInternal(_.take(limit))
}

Thus, there is no way to avoid processing all data before the shuffle. I think 
that is the reason. Do I understand correctly?

Thanks.

Zhan Zhang
On Apr 18, 2016, at 10:08 PM, Reynold Xin 
> wrote:

Unless I'm really missing something I don't think so. As I said, it goes 
through an iterator and after processing each stream side we do a shouldStop 
check. The generated code looks like

/* 094 */   protected void processNext() throws java.io.IOException {
/* 095 */ /*** PRODUCE: Project [id#79L] */
/* 096 */
/* 097 */ /*** PRODUCE: BroadcastHashJoin [id#79L], [id#82L], Inner, 
BuildRight, None */
/* 098 */
/* 099 */ /*** PRODUCE: Range 0, 1, 8, 100, [id#79L] */
/* 100 */
/* 101 */ // initialize Range
/* 102 */ if (!range_initRange) {
/* 103 */   range_initRange = true;
/* 104 */   initRange(partitionIndex);
/* 105 */ }
/* 106 */
/* 107 */ while (!range_overflow && range_number < range_partitionEnd) {
/* 108 */   long range_value = range_number;
/* 109 */   range_number += 1L;
/* 110 */   if (range_number < range_value ^ 1L < 0) {
/* 111 */ range_overflow = true;
/* 112 */   }
/* 113 */
/* 114 */   /*** CONSUME: BroadcastHashJoin [id#79L], [id#82L], Inner, 
BuildRight, None */
/* 115 */
/* 116 */   // generate join key for stream side
/* 117 */
/* 118 */   // find matches from HashedRelation
/* 119 */   UnsafeRow bhj_matched = false ? null: 
(UnsafeRow)bhj_relation.getValue(range_value);
/* 120 */   if (bhj_matched == null) continue;
/* 121 */
/* 122 */   bhj_metricValue.add(1);
/* 123 */
/* 124 */   /*** CONSUME: Project [id#79L] */
/* 125 */
/* 126 */   System.out.println("i got one row");
/* 127 */
/* 128 */   /*** CONSUME: WholeStageCodegen */
/* 129 */
/* 130 */   project_rowWriter.write(0, range_value);
/* 131 */   append(project_result);
/* 132 */
/* 133 */   if (shouldStop()) return;
/* 134 */ }
/* 135 */   }
/* 136 */ }


shouldStop is false once we go pass the limit.



On Mon, Apr 18, 2016 at 9:44 PM, Zhan Zhang 
> wrote:
>From the physical plan, the limit is one level up than the WholeStageCodegen, 
>Thus, I don’t think shouldStop would work here. To move it work, the limit has 
>to be part of the wholeStageCodeGen.

Correct me if I am wrong.

Thanks.

Zhan Zhang

On Apr 18, 2016, at 11:09 AM, Reynold Xin 
> wrote:

I could be wrong but I think we currently do that through whole stage codegen. 
After processing every row on the stream side, the generated code for broadcast 
join checks whether it has hit the limit or not (through this thing called 
shouldStop).

It is not the most optimal solution, because a single stream side row might 
output multiple hits, but it is usually not a problem.


On Mon, Apr 18, 2016 at 10:46 AM, Andrew Ray 
> wrote:
While you can't automatically push the limit *through* the join, we could push 
it *into* the join (stop processing after generating 10 records). I believe 
that is what Rajesh is suggesting.

On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier 
> wrote:
I am not sure if you can push a limit through a join. This becomes problematic 
if not all keys are present on both sides; in such a case a limit can produce 
fewer rows than the set limit.

This might be a rare case in which whole stage codegen is slower, due to the 
fact that we need to buffer the result of such a stage. You could try to 
disable it by setting "spark.sql.codegen.wholeStage" to false.

2016-04-12 14:32 GMT+02:00 Rajesh Balamohan 
>:
Hi,

I ran the following query in spark (latest master codebase) and it took a lot 
of time to complete even though it was a broadcast hash join.

It appears that limit computation is done only after computing complete join 
condition.  Shouldn't the limit condition be pushed to BroadcastHashJoin 
(wherein it would have to stop processing after generating 10 rows?).  Please 
let me know if my understanding on this is wrong.


select l_partkey from lineitem, partsupp where ps_partkey=l_partkey limit 10;


| == Physical Plan ==
CollectLimit 10
+- WholeStageCodegen
   :  +- Project [l_partkey#893]
   : +- BroadcastHashJoin [l_partkey#893], [ps_partkey#908], Inner, 
BuildRight, None
   ::- Project [l_partkey#893]
   ::  +- Filter 

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
Unless I'm really missing something I don't think so. As I said, it goes
through an iterator and after processing each stream side we do a
shouldStop check. The generated code looks like

/* 094 */   protected void processNext() throws java.io.IOException {
/* 095 */ /*** PRODUCE: Project [id#79L] */
/* 096 */
/* 097 */ /*** PRODUCE: BroadcastHashJoin [id#79L], [id#82L], Inner,
BuildRight, None */
/* 098 */
/* 099 */ /*** PRODUCE: Range 0, 1, 8, 100, [id#79L] */
/* 100 */
/* 101 */ // initialize Range
/* 102 */ if (!range_initRange) {
/* 103 */   range_initRange = true;
/* 104 */   initRange(partitionIndex);
/* 105 */ }
/* 106 */
/* 107 */ while (!range_overflow && range_number < range_partitionEnd) {
/* 108 */   long range_value = range_number;
/* 109 */   range_number += 1L;
/* 110 */   if (range_number < range_value ^ 1L < 0) {
/* 111 */ range_overflow = true;
/* 112 */   }
/* 113 */
/* 114 */   /*** CONSUME: BroadcastHashJoin [id#79L], [id#82L], Inner,
BuildRight, None */
/* 115 */
/* 116 */   // generate join key for stream side
/* 117 */
/* 118 */   // find matches from HashedRelation
/* 119 */   UnsafeRow bhj_matched = false ? null:
(UnsafeRow)bhj_relation.getValue(range_value);
/* 120 */   if (bhj_matched == null) continue;
/* 121 */
/* 122 */   bhj_metricValue.add(1);
/* 123 */
/* 124 */   /*** CONSUME: Project [id#79L] */
/* 125 */
/* 126 */   System.out.println("i got one row");
/* 127 */
/* 128 */   /*** CONSUME: WholeStageCodegen */
/* 129 */
/* 130 */   project_rowWriter.write(0, range_value);
/* 131 */   append(project_result);
/* 132 */
*/* 133 */   if (shouldStop()) return;*
/* 134 */ }
/* 135 */   }
/* 136 */ }


shouldStop is false once we go pass the limit.



On Mon, Apr 18, 2016 at 9:44 PM, Zhan Zhang  wrote:

> From the physical plan, the limit is one level up than the
> WholeStageCodegen, Thus, I don’t think shouldStop would work here. To move
> it work, the limit has to be part of the wholeStageCodeGen.
>
> Correct me if I am wrong.
>
> Thanks.
>
> Zhan Zhang
>
> On Apr 18, 2016, at 11:09 AM, Reynold Xin  wrote:
>
> I could be wrong but I think we currently do that through whole stage
> codegen. After processing every row on the stream side, the generated code
> for broadcast join checks whether it has hit the limit or not (through this
> thing called shouldStop).
>
> It is not the most optimal solution, because a single stream side row
> might output multiple hits, but it is usually not a problem.
>
>
> On Mon, Apr 18, 2016 at 10:46 AM, Andrew Ray  wrote:
>
>> While you can't automatically push the limit *through* the join, we could
>> push it *into* the join (stop processing after generating 10 records). I
>> believe that is what Rajesh is suggesting.
>>
>> On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> I am not sure if you can push a limit through a join. This becomes
>>> problematic if not all keys are present on both sides; in such a case a
>>> limit can produce fewer rows than the set limit.
>>>
>>> This might be a rare case in which whole stage codegen is slower, due to
>>> the fact that we need to buffer the result of such a stage. You could try
>>> to disable it by setting "spark.sql.codegen.wholeStage" to false.
>>>
>>> 2016-04-12 14:32 GMT+02:00 Rajesh Balamohan 
>>> :
>>>
 Hi,

 I ran the following query in spark (latest master codebase) and it took
 a lot of time to complete even though it was a broadcast hash join.

 It appears that limit computation is done only after computing complete
 join condition.  Shouldn't the limit condition be pushed to
 BroadcastHashJoin (wherein it would have to stop processing after
 generating 10 rows?).  Please let me know if my understanding on this is
 wrong.


 select l_partkey from lineitem, partsupp where ps_partkey=l_partkey
 limit 10;

 
 | == Physical Plan ==
 CollectLimit 10
 +- WholeStageCodegen
:  +- Project [l_partkey#893]
: +- BroadcastHashJoin [l_partkey#893], [ps_partkey#908], Inner,
 BuildRight, None
::- Project [l_partkey#893]
::  +- Filter isnotnull(l_partkey#893)
:: +- Scan HadoopFiles[l_partkey#893] Format: ORC,
 PushedFilters: [IsNotNull(l_partkey)], ReadSchema: struct
:+- INPUT
+- BroadcastExchange
 HashedRelationBroadcastMode(true,List(cast(ps_partkey#908 as
 bigint)),List(ps_partkey#908))
   +- WholeStageCodegen
  :  +- Project [ps_partkey#908]
  : +- Filter isnotnull(ps_partkey#908)
  :+- Scan HadoopFiles[ps_partkey#908] Format: ORC,
 PushedFilters: [IsNotNull(ps_partkey)], ReadSchema: struct
  

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
Thanks a lot for commenting. We are getting great feedback on this thread.
The take-aways are:

1. In general people prefer having explicit reasons why pull requests
should be closed. We should push committers to leave messages that are more
explicit about why certain PR should be closed or not. I can't agree more.
But this is not mutually exclusive.


2.  It is difficult to deal with the scale we are talking about. There is
not a single measure that could "fix" everything.

Spark is as far as I know one of the most active open source projects in
terms of contributions, in part because we have made it very easy to accept
contributions. There have been very few open source projects that needed to
deal with this scale. Actually if you look at all the historic PRs, we
closed 12k and have ~450 open. That's less than 4% of the prs outstanding
-- not a bad number. The actual ratio is likely even lower because many of
the 450 open will be merged in the future.

I also took a look at some of the most popular projects on github (e.g.
jquery, angular, react) -- they either have far fewer merged pull requests
or a higher ratio of open-to-close. So we are actually doing pretty well.
But of course there is always room for improvement.




On Mon, Apr 18, 2016 at 8:46 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>
> A lot of this was discussed a while back when the PR Dashboard was first
> introduced, and several times before and after that as well. (e.g. August
> 2014
> 
> )
>
> If there is not enough momentum to build the tooling that people are
> discussing here, then perhaps Reynold's suggestion is the most practical
> one that is likely to see the light of day.
>
> I think asking committers to be more active in commenting on PRs is
> theoretically the correct thing to do, but impractical. I'm not a
> committer, but I would guess that most of them are already way
> overcommitted (ha!) and asking them to do more just won't yield results.
>
> We've had several instances in the past where we all tried to rally
> 
> and be more proactive about giving feedback, closing PRs, and nudging
> contributors who have gone silent. My observation is that the level of
> energy required to "properly" curate PR activity in that way is simply not
> sustainable. People can do it for a few weeks and then things revert to the
> way they are now.
>
> Perhaps the missing link that would make this sustainable is better
> tooling. If you think so and can sling some Javascript, you might want to
> contribute to the PR Dashboard .
>
> Perhaps the missing link is something else: A different PR review process;
> more committers; a higher barrier to contributing; a combination thereof;
> etc...
>
> Also relevant: http://danluu.com/discourage-oss/
>
> By the way, some people noted that closing PRs may discourage
> contributors. I think our open PR count alone is very discouraging. Under
> what circumstances would you feel encouraged to open a PR against a project
> that has hundreds of open PRs, some from many, many months ago
> 
> ?
>
> Nick
>
>
> 2016년 4월 18일 (월) 오후 10:30, Ted Yu 님이 작성:
>
>> During the months of November / December, the 30 day period should be
>> relaxed.
>>
>> Some people(at least in US) may take extended vacation during that time.
>>
>> For Chinese developers, Spring Festival would bear similar circumstance.
>>
>> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon 
>> wrote:
>>
>>> I also think this might not have to be closed only because it is
>>> inactive.
>>>
>>>
>>> How about closing issues after 30 days when a committer's comment is
>>> added at the last without responses from the author?
>>>
>>>
>>> IMHO, If the committers are not sure whether the patch would be useful,
>>> then I think they should leave some comments why they are not sure, not
>>> just ignoring.
>>>
>>> Or, simply they could ask the author to prove that the patch is useful
>>> or safe with some references and tests.
>>>
>>>
>>> I think it might be nicer than that users are supposed to keep pinging.
>>> **Personally**, apparently, I am sometimes a bit worried if pinging
>>> multiple times can be a bit annoying.
>>>
>>>
>>>
>>> 2016-04-19 9:56 GMT+09:00 Saisai Shao :
>>>
 It would be better to have a specific technical reason why this PR
 should be closed, either the implementation is not good or the problem is
 not valid, or something else. That will actually help the contributor to
 shape their codes and reopen the PR again. Otherwise reasons like "feel
 

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
>From the physical plan, the limit is one level up than the WholeStageCodegen, 
>Thus, I don’t think shouldStop would work here. To move it work, the limit has 
>to be part of the wholeStageCodeGen.

Correct me if I am wrong.

Thanks.

Zhan Zhang

On Apr 18, 2016, at 11:09 AM, Reynold Xin 
> wrote:

I could be wrong but I think we currently do that through whole stage codegen. 
After processing every row on the stream side, the generated code for broadcast 
join checks whether it has hit the limit or not (through this thing called 
shouldStop).

It is not the most optimal solution, because a single stream side row might 
output multiple hits, but it is usually not a problem.


On Mon, Apr 18, 2016 at 10:46 AM, Andrew Ray 
> wrote:
While you can't automatically push the limit *through* the join, we could push 
it *into* the join (stop processing after generating 10 records). I believe 
that is what Rajesh is suggesting.

On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier 
> wrote:
I am not sure if you can push a limit through a join. This becomes problematic 
if not all keys are present on both sides; in such a case a limit can produce 
fewer rows than the set limit.

This might be a rare case in which whole stage codegen is slower, due to the 
fact that we need to buffer the result of such a stage. You could try to 
disable it by setting "spark.sql.codegen.wholeStage" to false.

2016-04-12 14:32 GMT+02:00 Rajesh Balamohan 
>:
Hi,

I ran the following query in spark (latest master codebase) and it took a lot 
of time to complete even though it was a broadcast hash join.

It appears that limit computation is done only after computing complete join 
condition.  Shouldn't the limit condition be pushed to BroadcastHashJoin 
(wherein it would have to stop processing after generating 10 rows?).  Please 
let me know if my understanding on this is wrong.


select l_partkey from lineitem, partsupp where ps_partkey=l_partkey limit 10;


| == Physical Plan ==
CollectLimit 10
+- WholeStageCodegen
   :  +- Project [l_partkey#893]
   : +- BroadcastHashJoin [l_partkey#893], [ps_partkey#908], Inner, 
BuildRight, None
   ::- Project [l_partkey#893]
   ::  +- Filter isnotnull(l_partkey#893)
   :: +- Scan HadoopFiles[l_partkey#893] Format: ORC, 
PushedFilters: [IsNotNull(l_partkey)], ReadSchema: struct
   :+- INPUT
   +- BroadcastExchange 
HashedRelationBroadcastMode(true,List(cast(ps_partkey#908 as 
bigint)),List(ps_partkey#908))
  +- WholeStageCodegen
 :  +- Project [ps_partkey#908]
 : +- Filter isnotnull(ps_partkey#908)
 :+- Scan HadoopFiles[ps_partkey#908] Format: ORC, 
PushedFilters: [IsNotNull(ps_partkey)], ReadSchema: struct  |





--
~Rajesh.B






Re: more uniform exception handling?

2016-04-18 Thread Zhan Zhang
+1
Both of the would be very helpful in debugging

Thanks.

Zhan Zhang

On Apr 18, 2016, at 1:18 PM, Evan Chan  wrote:

> +1000.
> 
> Especially if the UI can help correlate exceptions, and we can reduce
> some exceptions.
> 
> There are some exceptions which are in practice very common, such as
> the nasty ClassNotFoundException, that most folks end up spending tons
> of time debugging.
> 
> 
> On Mon, Apr 18, 2016 at 12:16 PM, Reynold Xin  wrote:
>> Josh's pull request on rpc exception handling got me to think ...
>> 
>> In my experience, there have been a few things related exceptions that
>> created a lot of trouble for us in production debugging:
>> 
>> 1. Some exception is thrown, but is caught by some try/catch that does not
>> do any logging nor rethrow.
>> 2. Some exception is thrown, but is caught by some try/catch that does not
>> do any logging, but do rethrow. But the original exception is now masked.
>> 2. Multiple exceptions are logged at different places close to each other,
>> but we don't know whether they are caused by the same problem or not.
>> 
>> 
>> To mitigate some of the above, here's an idea ...
>> 
>> (1) Create a common root class for all the exceptions (e.g. call it
>> SparkException) used in Spark. We should make sure every time we catch an
>> exception from a 3rd party library, we rethrow them as SparkException (a lot
>> of places already do that). In SparkException's constructor, log the
>> exception and the stacktrace.
>> 
>> (2) SparkException has a monotonically increasing ID, and this ID appears in
>> the exception error message (say at the end).
>> 
>> 
>> I think (1) will eliminate most of the cases that an exception gets
>> swallowed. The main downside I can think of is we might log an exception
>> multiple times. However, I'd argue exceptions should be rare, and it is not
>> that big of a deal to log them twice or three times. The unique ID (2) can
>> help us correlate exceptions if they appear multiple times.
>> 
>> Thoughts?
>> 
>> 
>> 
>> 
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
bq. there should be more committers or they are asked to be more active.

Bingo.

bq. they can't be closed only because it is "expired" with a copy and
pasted message.

+1

On Mon, Apr 18, 2016 at 9:14 PM, Hyukjin Kwon  wrote:

> I don't think asking committers to be more active is impractical. I am
> not too sure if other projects apply the same rules here but
>
> I think if a project is being more popular, then I think it is appropriate
> that there should be more committers or they are asked to be more active.
>
>
> In addition, I believe there are a lot of PRs waiting for committer's
> comments.
>
>
> If committers are too busy to review a PR, then I think they better ask
> authors to provide the evidence to decide, maybe with a message such as
>
> "I am currently too busy to review or decide. Could you please add some 
> evidence/benchmark/performance
> test or survey for demands?"
>
>
> If the evidence is not enough or not easy to see, then they can ask to
> simplify the evidence or make a proper conclusion, maybe with a message
> such as
>
> "I think the evidence is not enough/trustable because  Could you
> please simplify/provide some more evidence?".
>
>
>
> Or, I think they can be manually closed with a explicit message such as
>
> "This is closed for now because we are not sure for this patch because.."
>
>
> I think they can't be closed only because it is "expired" with a copy and
> pasted message.
>
>
>
> 2016-04-19 12:46 GMT+09:00 Nicholas Chammas :
>
>> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>>
>> A lot of this was discussed a while back when the PR Dashboard was first
>> introduced, and several times before and after that as well. (e.g. August
>> 2014
>> 
>> )
>>
>> If there is not enough momentum to build the tooling that people are
>> discussing here, then perhaps Reynold's suggestion is the most practical
>> one that is likely to see the light of day.
>>
>> I think asking committers to be more active in commenting on PRs is
>> theoretically the correct thing to do, but impractical. I'm not a
>> committer, but I would guess that most of them are already way
>> overcommitted (ha!) and asking them to do more just won't yield results.
>>
>> We've had several instances in the past where we all tried to rally
>> 
>> and be more proactive about giving feedback, closing PRs, and nudging
>> contributors who have gone silent. My observation is that the level of
>> energy required to "properly" curate PR activity in that way is simply not
>> sustainable. People can do it for a few weeks and then things revert to the
>> way they are now.
>>
>> Perhaps the missing link that would make this sustainable is better
>> tooling. If you think so and can sling some Javascript, you might want
>> to contribute to the PR Dashboard .
>>
>> Perhaps the missing link is something else: A different PR review
>> process; more committers; a higher barrier to contributing; a combination
>> thereof; etc...
>>
>> Also relevant: http://danluu.com/discourage-oss/
>>
>> By the way, some people noted that closing PRs may discourage
>> contributors. I think our open PR count alone is very discouraging. Under
>> what circumstances would you feel encouraged to open a PR against a project
>> that has hundreds of open PRs, some from many, many months ago
>> 
>> ?
>>
>> Nick
>>
>>
>> 2016년 4월 18일 (월) 오후 10:30, Ted Yu 님이 작성:
>>
>>> During the months of November / December, the 30 day period should be
>>> relaxed.
>>>
>>> Some people(at least in US) may take extended vacation during that time.
>>>
>>> For Chinese developers, Spring Festival would bear similar circumstance.
>>>
>>> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon 
>>> wrote:
>>>
 I also think this might not have to be closed only because it is
 inactive.


 How about closing issues after 30 days when a committer's comment is
 added at the last without responses from the author?


 IMHO, If the committers are not sure whether the patch would be
 useful, then I think they should leave some comments why they are not sure,
 not just ignoring.

 Or, simply they could ask the author to prove that the patch is useful
 or safe with some references and tests.


 I think it might be nicer than that users are supposed to keep pinging.
 **Personally**, apparently, I am sometimes a bit worried if pinging
 multiple times can be a bit annoying.



 2016-04-19 9:56 GMT+09:00 Saisai Shao :

> It would be better 

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Saisai Shao
>>>By the way, some people noted that closing PRs may discourage
contributors. I think our open PR count alone is very discouraging. Under
what circumstances would you feel encouraged to open a PR against a project
that has hundreds of open PRs, some from many, many months ago

?

I think the original meaning of "discouraging contributors" is  closing
without specific technical reasons, or just lack of bandwidth. These PRs
may not be so important for committers/maintainers, but for individual
contributor especially new open source guy a simple fix for a famous
project means a lot. We actually can have other solutions like setting a
high bar beforehand to reduce the PR number.

Thanks
Jerry



On Tue, Apr 19, 2016 at 11:46 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>
> A lot of this was discussed a while back when the PR Dashboard was first
> introduced, and several times before and after that as well. (e.g. August
> 2014
> 
> )
>
> If there is not enough momentum to build the tooling that people are
> discussing here, then perhaps Reynold's suggestion is the most practical
> one that is likely to see the light of day.
>
> I think asking committers to be more active in commenting on PRs is
> theoretically the correct thing to do, but impractical. I'm not a
> committer, but I would guess that most of them are already way
> overcommitted (ha!) and asking them to do more just won't yield results.
>
> We've had several instances in the past where we all tried to rally
> 
> and be more proactive about giving feedback, closing PRs, and nudging
> contributors who have gone silent. My observation is that the level of
> energy required to "properly" curate PR activity in that way is simply not
> sustainable. People can do it for a few weeks and then things revert to the
> way they are now.
>
> Perhaps the missing link that would make this sustainable is better
> tooling. If you think so and can sling some Javascript, you might want to
> contribute to the PR Dashboard .
>
> Perhaps the missing link is something else: A different PR review process;
> more committers; a higher barrier to contributing; a combination thereof;
> etc...
>
> Also relevant: http://danluu.com/discourage-oss/
>
> By the way, some people noted that closing PRs may discourage
> contributors. I think our open PR count alone is very discouraging. Under
> what circumstances would you feel encouraged to open a PR against a project
> that has hundreds of open PRs, some from many, many months ago
> 
> ?
>
> Nick
>
>
> 2016년 4월 18일 (월) 오후 10:30, Ted Yu 님이 작성:
>
>> During the months of November / December, the 30 day period should be
>> relaxed.
>>
>> Some people(at least in US) may take extended vacation during that time.
>>
>> For Chinese developers, Spring Festival would bear similar circumstance.
>>
>> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon 
>> wrote:
>>
>>> I also think this might not have to be closed only because it is
>>> inactive.
>>>
>>>
>>> How about closing issues after 30 days when a committer's comment is
>>> added at the last without responses from the author?
>>>
>>>
>>> IMHO, If the committers are not sure whether the patch would be useful,
>>> then I think they should leave some comments why they are not sure, not
>>> just ignoring.
>>>
>>> Or, simply they could ask the author to prove that the patch is useful
>>> or safe with some references and tests.
>>>
>>>
>>> I think it might be nicer than that users are supposed to keep pinging.
>>> **Personally**, apparently, I am sometimes a bit worried if pinging
>>> multiple times can be a bit annoying.
>>>
>>>
>>>
>>> 2016-04-19 9:56 GMT+09:00 Saisai Shao :
>>>
 It would be better to have a specific technical reason why this PR
 should be closed, either the implementation is not good or the problem is
 not valid, or something else. That will actually help the contributor to
 shape their codes and reopen the PR again. Otherwise reasons like "feel
 free to reopen for so-and-so reason" is actually discouraging and no
 difference than directly close the PR.

 Just my two cents.

 Thanks
 Jerry


 On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey 
 wrote:

> Having a PR closed, especially if due to committers not having hte
> bandwidth to check on things, will be very discouraging to new folks.
> Doubly so for those inexperienced with 

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Hyukjin Kwon
I don't think asking committers to be more active is impractical. I am not
too sure if other projects apply the same rules here but

I think if a project is being more popular, then I think it is appropriate
that there should be more committers or they are asked to be more active.


In addition, I believe there are a lot of PRs waiting for committer's
comments.


If committers are too busy to review a PR, then I think they better ask
authors to provide the evidence to decide, maybe with a message such as

"I am currently too busy to review or decide. Could you please add
some evidence/benchmark/performance
test or survey for demands?"


If the evidence is not enough or not easy to see, then they can ask to
simplify the evidence or make a proper conclusion, maybe with a message
such as

"I think the evidence is not enough/trustable because  Could you
please simplify/provide
some more evidence?".



Or, I think they can be manually closed with a explicit message such as

"This is closed for now because we are not sure for this patch because.."


I think they can't be closed only because it is "expired" with a copy and
pasted message.



2016-04-19 12:46 GMT+09:00 Nicholas Chammas :

> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>
> A lot of this was discussed a while back when the PR Dashboard was first
> introduced, and several times before and after that as well. (e.g. August
> 2014
> 
> )
>
> If there is not enough momentum to build the tooling that people are
> discussing here, then perhaps Reynold's suggestion is the most practical
> one that is likely to see the light of day.
>
> I think asking committers to be more active in commenting on PRs is
> theoretically the correct thing to do, but impractical. I'm not a
> committer, but I would guess that most of them are already way
> overcommitted (ha!) and asking them to do more just won't yield results.
>
> We've had several instances in the past where we all tried to rally
> 
> and be more proactive about giving feedback, closing PRs, and nudging
> contributors who have gone silent. My observation is that the level of
> energy required to "properly" curate PR activity in that way is simply not
> sustainable. People can do it for a few weeks and then things revert to the
> way they are now.
>
> Perhaps the missing link that would make this sustainable is better
> tooling. If you think so and can sling some Javascript, you might want to
> contribute to the PR Dashboard .
>
> Perhaps the missing link is something else: A different PR review process;
> more committers; a higher barrier to contributing; a combination thereof;
> etc...
>
> Also relevant: http://danluu.com/discourage-oss/
>
> By the way, some people noted that closing PRs may discourage
> contributors. I think our open PR count alone is very discouraging. Under
> what circumstances would you feel encouraged to open a PR against a project
> that has hundreds of open PRs, some from many, many months ago
> 
> ?
>
> Nick
>
>
> 2016년 4월 18일 (월) 오후 10:30, Ted Yu 님이 작성:
>
>> During the months of November / December, the 30 day period should be
>> relaxed.
>>
>> Some people(at least in US) may take extended vacation during that time.
>>
>> For Chinese developers, Spring Festival would bear similar circumstance.
>>
>> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon 
>> wrote:
>>
>>> I also think this might not have to be closed only because it is
>>> inactive.
>>>
>>>
>>> How about closing issues after 30 days when a committer's comment is
>>> added at the last without responses from the author?
>>>
>>>
>>> IMHO, If the committers are not sure whether the patch would be useful,
>>> then I think they should leave some comments why they are not sure, not
>>> just ignoring.
>>>
>>> Or, simply they could ask the author to prove that the patch is useful
>>> or safe with some references and tests.
>>>
>>>
>>> I think it might be nicer than that users are supposed to keep pinging.
>>> **Personally**, apparently, I am sometimes a bit worried if pinging
>>> multiple times can be a bit annoying.
>>>
>>>
>>>
>>> 2016-04-19 9:56 GMT+09:00 Saisai Shao :
>>>
 It would be better to have a specific technical reason why this PR
 should be closed, either the implementation is not good or the problem is
 not valid, or something else. That will actually help the contributor to
 shape their codes and reopen the PR again. Otherwise reasons like "feel
 free to reopen for so-and-so reason" is actually discouraging and no
 difference than directly close 

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Nicholas Chammas
Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1

A lot of this was discussed a while back when the PR Dashboard was first
introduced, and several times before and after that as well. (e.g. August
2014

)

If there is not enough momentum to build the tooling that people are
discussing here, then perhaps Reynold's suggestion is the most practical
one that is likely to see the light of day.

I think asking committers to be more active in commenting on PRs is
theoretically the correct thing to do, but impractical. I'm not a
committer, but I would guess that most of them are already way
overcommitted (ha!) and asking them to do more just won't yield results.

We've had several instances in the past where we all tried to rally

and be more proactive about giving feedback, closing PRs, and nudging
contributors who have gone silent. My observation is that the level of
energy required to "properly" curate PR activity in that way is simply not
sustainable. People can do it for a few weeks and then things revert to the
way they are now.

Perhaps the missing link that would make this sustainable is better
tooling. If you think so and can sling some Javascript, you might want to
contribute to the PR Dashboard .

Perhaps the missing link is something else: A different PR review process;
more committers; a higher barrier to contributing; a combination thereof;
etc...

Also relevant: http://danluu.com/discourage-oss/

By the way, some people noted that closing PRs may discourage contributors.
I think our open PR count alone is very discouraging. Under what
circumstances would you feel encouraged to open a PR against a project that
has hundreds of open PRs, some from many, many months ago

?

Nick


2016년 4월 18일 (월) 오후 10:30, Ted Yu 님이 작성:

> During the months of November / December, the 30 day period should be
> relaxed.
>
> Some people(at least in US) may take extended vacation during that time.
>
> For Chinese developers, Spring Festival would bear similar circumstance.
>
> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon  wrote:
>
>> I also think this might not have to be closed only because it is
>> inactive.
>>
>>
>> How about closing issues after 30 days when a committer's comment is
>> added at the last without responses from the author?
>>
>>
>> IMHO, If the committers are not sure whether the patch would be useful,
>> then I think they should leave some comments why they are not sure, not
>> just ignoring.
>>
>> Or, simply they could ask the author to prove that the patch is useful
>> or safe with some references and tests.
>>
>>
>> I think it might be nicer than that users are supposed to keep pinging.
>> **Personally**, apparently, I am sometimes a bit worried if pinging
>> multiple times can be a bit annoying.
>>
>>
>>
>> 2016-04-19 9:56 GMT+09:00 Saisai Shao :
>>
>>> It would be better to have a specific technical reason why this PR
>>> should be closed, either the implementation is not good or the problem is
>>> not valid, or something else. That will actually help the contributor to
>>> shape their codes and reopen the PR again. Otherwise reasons like "feel
>>> free to reopen for so-and-so reason" is actually discouraging and no
>>> difference than directly close the PR.
>>>
>>> Just my two cents.
>>>
>>> Thanks
>>> Jerry
>>>
>>>
>>> On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey 
>>> wrote:
>>>
 Having a PR closed, especially if due to committers not having hte
 bandwidth to check on things, will be very discouraging to new folks.
 Doubly so for those inexperienced with opensource. Even if the message
 says "feel free to reopen for so-and-so reason", new folks who lack
 confidence are going to see reopening as "pestering" and busy folks
 are going to see it as a clear indication that their work is not even
 valuable enough for a human to give a reason for closing. In either
 case, the cost of reopening is substantially higher than that button
 press.

 How about we start by keeping a report of "at-risk" PRs that have been
 stale for 30 days to make it easier for committers to look at the prs
 that have been long inactive?


 On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin 
 wrote:
 > The cost of "reopen" is close to zero, because it is just clicking a
 button.
 > I think you were referring to the cost of closing the pull request,
 and you
 > are assuming people look at the pull requests that have been inactive
 for a
 > long time. That seems equally likely (or 

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
During the months of November / December, the 30 day period should be
relaxed.

Some people(at least in US) may take extended vacation during that time.

For Chinese developers, Spring Festival would bear similar circumstance.

On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon  wrote:

> I also think this might not have to be closed only because it is inactive.
>
>
> How about closing issues after 30 days when a committer's comment is added
> at the last without responses from the author?
>
>
> IMHO, If the committers are not sure whether the patch would be useful,
> then I think they should leave some comments why they are not sure, not
> just ignoring.
>
> Or, simply they could ask the author to prove that the patch is useful or
> safe with some references and tests.
>
>
> I think it might be nicer than that users are supposed to keep pinging.
> **Personally**, apparently, I am sometimes a bit worried if pinging
> multiple times can be a bit annoying.
>
>
>
> 2016-04-19 9:56 GMT+09:00 Saisai Shao :
>
>> It would be better to have a specific technical reason why this PR should
>> be closed, either the implementation is not good or the problem is not
>> valid, or something else. That will actually help the contributor to shape
>> their codes and reopen the PR again. Otherwise reasons like "feel free
>> to reopen for so-and-so reason" is actually discouraging and no difference
>> than directly close the PR.
>>
>> Just my two cents.
>>
>> Thanks
>> Jerry
>>
>>
>> On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey  wrote:
>>
>>> Having a PR closed, especially if due to committers not having hte
>>> bandwidth to check on things, will be very discouraging to new folks.
>>> Doubly so for those inexperienced with opensource. Even if the message
>>> says "feel free to reopen for so-and-so reason", new folks who lack
>>> confidence are going to see reopening as "pestering" and busy folks
>>> are going to see it as a clear indication that their work is not even
>>> valuable enough for a human to give a reason for closing. In either
>>> case, the cost of reopening is substantially higher than that button
>>> press.
>>>
>>> How about we start by keeping a report of "at-risk" PRs that have been
>>> stale for 30 days to make it easier for committers to look at the prs
>>> that have been long inactive?
>>>
>>>
>>> On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin 
>>> wrote:
>>> > The cost of "reopen" is close to zero, because it is just clicking a
>>> button.
>>> > I think you were referring to the cost of closing the pull request,
>>> and you
>>> > are assuming people look at the pull requests that have been inactive
>>> for a
>>> > long time. That seems equally likely (or unlikely) as committers
>>> looking at
>>> > the recently closed pull requests.
>>> >
>>> > In either case, most pull requests are scanned through by us when they
>>> are
>>> > first open, and if they are important enough, usually they get merged
>>> > quickly or a target version is set in JIRA. We can definitely improve
>>> that
>>> > by making it more explicit.
>>> >
>>> >
>>> >
>>> > On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu  wrote:
>>> >>
>>> >> From committers' perspective, would they look at closed PRs ?
>>> >>
>>> >> If not, the cost is not close to zero.
>>> >> Meaning, some potentially useful PRs would never see the light of day.
>>> >>
>>> >> My two cents.
>>> >>
>>> >> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin 
>>> wrote:
>>> >>>
>>> >>> Part of it is how difficult it is to automate this. We can build a
>>> >>> perfect engine with a lot of rules that understand everything. But
>>> the more
>>> >>> complicated rules we need, the more unlikely for any of these to
>>> happen. So
>>> >>> I'd rather do this and create a nice enough message to tell
>>> contributors
>>> >>> sometimes mistake happen but the cost to reopen is approximately
>>> zero (i.e.
>>> >>> click a button on the pull request).
>>> >>>
>>> >>>
>>> >>> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu 
>>> wrote:
>>> 
>>>  bq. close the ones where they don't respond for a week
>>> 
>>>  Does this imply that the script understands response from human ?
>>> 
>>>  Meaning, would the script use some regex which signifies that the
>>>  contributor is willing to close the PR ?
>>> 
>>>  If the contributor is willing to close, why wouldn't he / she do it
>>>  him/herself ?
>>> 
>>>  On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau <
>>> hol...@pigscanfly.ca>
>>>  wrote:
>>> >
>>> > Personally I'd rather err on the side of keeping PRs open, but I
>>> > understand wanting to keep the open PRs limited to ones which have
>>> a
>>> > reasonable chance of being merged.
>>> >
>>> > What about if we filtered for non-mergeable PRs or instead left a
>>> > comment asking the author to respond 

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Hyukjin Kwon
I also think this might not have to be closed only because it is inactive.


How about closing issues after 30 days when a committer's comment is added
at the last without responses from the author?


IMHO, If the committers are not sure whether the patch would be useful,
then I think they should leave some comments why they are not sure, not
just ignoring.

Or, simply they could ask the author to prove that the patch is useful or
safe with some references and tests.


I think it might be nicer than that users are supposed to keep pinging.
**Personally**, apparently, I am sometimes a bit worried if pinging
multiple times can be a bit annoying.



2016-04-19 9:56 GMT+09:00 Saisai Shao :

> It would be better to have a specific technical reason why this PR should
> be closed, either the implementation is not good or the problem is not
> valid, or something else. That will actually help the contributor to shape
> their codes and reopen the PR again. Otherwise reasons like "feel free to
> reopen for so-and-so reason" is actually discouraging and no difference
> than directly close the PR.
>
> Just my two cents.
>
> Thanks
> Jerry
>
>
> On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey  wrote:
>
>> Having a PR closed, especially if due to committers not having hte
>> bandwidth to check on things, will be very discouraging to new folks.
>> Doubly so for those inexperienced with opensource. Even if the message
>> says "feel free to reopen for so-and-so reason", new folks who lack
>> confidence are going to see reopening as "pestering" and busy folks
>> are going to see it as a clear indication that their work is not even
>> valuable enough for a human to give a reason for closing. In either
>> case, the cost of reopening is substantially higher than that button
>> press.
>>
>> How about we start by keeping a report of "at-risk" PRs that have been
>> stale for 30 days to make it easier for committers to look at the prs
>> that have been long inactive?
>>
>> On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin  wrote:
>> > The cost of "reopen" is close to zero, because it is just clicking a
>> button.
>> > I think you were referring to the cost of closing the pull request, and
>> you
>> > are assuming people look at the pull requests that have been inactive
>> for a
>> > long time. That seems equally likely (or unlikely) as committers
>> looking at
>> > the recently closed pull requests.
>> >
>> > In either case, most pull requests are scanned through by us when they
>> are
>> > first open, and if they are important enough, usually they get merged
>> > quickly or a target version is set in JIRA. We can definitely improve
>> that
>> > by making it more explicit.
>> >
>> >
>> >
>> > On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu  wrote:
>> >>
>> >> From committers' perspective, would they look at closed PRs ?
>> >>
>> >> If not, the cost is not close to zero.
>> >> Meaning, some potentially useful PRs would never see the light of day.
>> >>
>> >> My two cents.
>> >>
>> >> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin 
>> wrote:
>> >>>
>> >>> Part of it is how difficult it is to automate this. We can build a
>> >>> perfect engine with a lot of rules that understand everything. But
>> the more
>> >>> complicated rules we need, the more unlikely for any of these to
>> happen. So
>> >>> I'd rather do this and create a nice enough message to tell
>> contributors
>> >>> sometimes mistake happen but the cost to reopen is approximately zero
>> (i.e.
>> >>> click a button on the pull request).
>> >>>
>> >>>
>> >>> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:
>> 
>>  bq. close the ones where they don't respond for a week
>> 
>>  Does this imply that the script understands response from human ?
>> 
>>  Meaning, would the script use some regex which signifies that the
>>  contributor is willing to close the PR ?
>> 
>>  If the contributor is willing to close, why wouldn't he / she do it
>>  him/herself ?
>> 
>>  On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau > >
>>  wrote:
>> >
>> > Personally I'd rather err on the side of keeping PRs open, but I
>> > understand wanting to keep the open PRs limited to ones which have a
>> > reasonable chance of being merged.
>> >
>> > What about if we filtered for non-mergeable PRs or instead left a
>> > comment asking the author to respond if they are still available to
>> move the
>> > PR forward - and close the ones where they don't respond for a week?
>> >
>> > Just a suggestion.
>> > On Monday, April 18, 2016, Ted Yu  wrote:
>> >>
>> >> I had one PR which got merged after 3 months.
>> >>
>> >> If the inactivity was due to contributor, I think it can be closed
>> >> after 30 days.
>> >> But if the inactivity was due 

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Saisai Shao
It would be better to have a specific technical reason why this PR should
be closed, either the implementation is not good or the problem is not
valid, or something else. That will actually help the contributor to shape
their codes and reopen the PR again. Otherwise reasons like "feel free to
reopen for so-and-so reason" is actually discouraging and no difference
than directly close the PR.

Just my two cents.

Thanks
Jerry


On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey  wrote:

> Having a PR closed, especially if due to committers not having hte
> bandwidth to check on things, will be very discouraging to new folks.
> Doubly so for those inexperienced with opensource. Even if the message
> says "feel free to reopen for so-and-so reason", new folks who lack
> confidence are going to see reopening as "pestering" and busy folks
> are going to see it as a clear indication that their work is not even
> valuable enough for a human to give a reason for closing. In either
> case, the cost of reopening is substantially higher than that button
> press.
>
> How about we start by keeping a report of "at-risk" PRs that have been
> stale for 30 days to make it easier for committers to look at the prs
> that have been long inactive?
>
> On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin  wrote:
> > The cost of "reopen" is close to zero, because it is just clicking a
> button.
> > I think you were referring to the cost of closing the pull request, and
> you
> > are assuming people look at the pull requests that have been inactive
> for a
> > long time. That seems equally likely (or unlikely) as committers looking
> at
> > the recently closed pull requests.
> >
> > In either case, most pull requests are scanned through by us when they
> are
> > first open, and if they are important enough, usually they get merged
> > quickly or a target version is set in JIRA. We can definitely improve
> that
> > by making it more explicit.
> >
> >
> >
> > On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu  wrote:
> >>
> >> From committers' perspective, would they look at closed PRs ?
> >>
> >> If not, the cost is not close to zero.
> >> Meaning, some potentially useful PRs would never see the light of day.
> >>
> >> My two cents.
> >>
> >> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin 
> wrote:
> >>>
> >>> Part of it is how difficult it is to automate this. We can build a
> >>> perfect engine with a lot of rules that understand everything. But the
> more
> >>> complicated rules we need, the more unlikely for any of these to
> happen. So
> >>> I'd rather do this and create a nice enough message to tell
> contributors
> >>> sometimes mistake happen but the cost to reopen is approximately zero
> (i.e.
> >>> click a button on the pull request).
> >>>
> >>>
> >>> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:
> 
>  bq. close the ones where they don't respond for a week
> 
>  Does this imply that the script understands response from human ?
> 
>  Meaning, would the script use some regex which signifies that the
>  contributor is willing to close the PR ?
> 
>  If the contributor is willing to close, why wouldn't he / she do it
>  him/herself ?
> 
>  On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau 
>  wrote:
> >
> > Personally I'd rather err on the side of keeping PRs open, but I
> > understand wanting to keep the open PRs limited to ones which have a
> > reasonable chance of being merged.
> >
> > What about if we filtered for non-mergeable PRs or instead left a
> > comment asking the author to respond if they are still available to
> move the
> > PR forward - and close the ones where they don't respond for a week?
> >
> > Just a suggestion.
> > On Monday, April 18, 2016, Ted Yu  wrote:
> >>
> >> I had one PR which got merged after 3 months.
> >>
> >> If the inactivity was due to contributor, I think it can be closed
> >> after 30 days.
> >> But if the inactivity was due to lack of review, the PR should be
> kept
> >> open.
> >>
> >> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger <
> c...@koeninger.org>
> >> wrote:
> >>>
> >>> For what it's worth, I have definitely had PRs that sat inactive
> for
> >>> more than 30 days due to committers not having time to look at
> them,
> >>> but did eventually end up successfully being merged.
> >>>
> >>> I guess if this just ends up being a committer ping and reopening
> the
> >>> PR, it's fine, but I don't know if it really addresses the
> underlying
> >>> issue.
> >>>
> >>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
> >>> wrote:
> >>> > We have hit a new high in open pull requests: 469 today. While we
> >>> > can
> >>> > certainly get more review bandwidth, 

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Mark Grover
Thanks for responding, Reynold, Marcelo and Marcin.

>And I think that's really what Mark is proposing. Basically, "don't
>intentionally break backwards compatibility unless it's really
>required" (e.g. SPARK-12130). That would allow option B to work.

Yeah, that's exactly what Option B is proposing.

I also don't think it'd make a huge difference to go back to full class
name but I have explicitly added Lianhui to this thread, who worked on
SPARK-12130, so he can correct me if I am blantantly wrong.

And, even then, we could keep the Spark1 and Spark2 shuffle services
compatible by doing mapping of short-long names or Abstract getBlockData
implementation, if we decide it's necessary.

Mark

On Mon, Apr 18, 2016 at 3:23 PM, Marcelo Vanzin  wrote:

> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
> > IIUC, the reason for that PR is that they found the string comparison to
> > increase the size in large shuffles. Maybe we should add the ability to
> > support the short name to Spark 1.6.2?
>
> Is that something that really yields noticeable gains in performance?
>
> If it is, it seems like it would be simple to allow executors register
> with the full class name, and map the long names to short names in the
> shuffle service itself.
>
> You could even get fancy and have different ExecutorShuffleInfo
> implementations for each shuffle service, with an abstract
> "getBlockData" method that gets called instead of the current if/else
> in ExternalShuffleBlockResolver.java.
>
> --
> Marcelo
>


Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcelo Vanzin
On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
> IIUC, the reason for that PR is that they found the string comparison to
> increase the size in large shuffles. Maybe we should add the ability to
> support the short name to Spark 1.6.2?

Is that something that really yields noticeable gains in performance?

If it is, it seems like it would be simple to allow executors register
with the full class name, and map the long names to short names in the
shuffle service itself.

You could even get fancy and have different ExecutorShuffleInfo
implementations for each shuffle service, with an abstract
"getBlockData" method that gets called instead of the current if/else
in ExternalShuffleBlockResolver.java.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
Got it. So Mark is pushing for "best-effort" support.

IIUC, the reason for that PR is that they found the string comparison to
increase the size in large shuffles. Maybe we should add the ability to
support the short name to Spark 1.6.2?


On Mon, Apr 18, 2016 at 3:05 PM, Marcelo Vanzin  wrote:

> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
> > The bigger problem is that it is much easier to maintain backward
> > compatibility rather than dictating forward compatibility. For example,
> as
> > Marcin said, if we come up with a slightly different shuffle layout to
> > improve shuffle performance, we wouldn't be able to do that if we want to
> > allow Spark 1.6 shuffle service to read something generated by Spark 2.1.
>
> And I think that's really what Mark is proposing. Basically, "don't
> intentionally break backwards compatibility unless it's really
> required" (e.g. SPARK-12130). That would allow option B to work.
>
> If a new shuffle manager is created, then neither option A nor option
> B would really work. Moving all the shuffle-related classes to a
> different package, to support option A, would be really messy. At that
> point, you're better off maintaining the new shuffle service outside
> of YARN, which is rather messy too.
>
> The best would be if the shuffle service didn't really need to
> understand the shuffle manager, and could find files regardless; I'm
> not sure how feasible that is, though.
>
> --
> Marcelo
>


Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcelo Vanzin
On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
> The bigger problem is that it is much easier to maintain backward
> compatibility rather than dictating forward compatibility. For example, as
> Marcin said, if we come up with a slightly different shuffle layout to
> improve shuffle performance, we wouldn't be able to do that if we want to
> allow Spark 1.6 shuffle service to read something generated by Spark 2.1.

And I think that's really what Mark is proposing. Basically, "don't
intentionally break backwards compatibility unless it's really
required" (e.g. SPARK-12130). That would allow option B to work.

If a new shuffle manager is created, then neither option A nor option
B would really work. Moving all the shuffle-related classes to a
different package, to support option A, would be really messy. At that
point, you're better off maintaining the new shuffle service outside
of YARN, which is rather messy too.

The best would be if the shuffle service didn't really need to
understand the shuffle manager, and could find files regardless; I'm
not sure how feasible that is, though.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
Yea I re-read the email again. It'd work in this case.

The bigger problem is that it is much easier to maintain backward
compatibility rather than dictating forward compatibility. For example, as
Marcin said, if we come up with a slightly different shuffle layout to
improve shuffle performance, we wouldn't be able to do that if we want to
allow Spark 1.6 shuffle service to read something generated by Spark 2.1.




On Mon, Apr 18, 2016 at 1:59 PM, Marcelo Vanzin  wrote:

> On Mon, Apr 18, 2016 at 1:53 PM, Reynold Xin  wrote:
> > That's not the only one. For example, the hash shuffle manager has been
> off
> > by default since Spark 1.2, and we'd like to remove it in 2.0:
> > https://github.com/apache/spark/pull/12423
>
> If I understand things correctly, Mark's option B (running a single
> shuffle service, the one from the older Spark release) would still
> work, wouldn't it?
>
> You'd run into problems when Spark adds a new shuffle manager that is
> not known to the old shuffle service, though. Perhaps at that time we
> should investigate making the shuffle service more agnostic to the
> app's shuffle manager.
>
> --
> Marcelo
>


Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcelo Vanzin
On Mon, Apr 18, 2016 at 1:53 PM, Reynold Xin  wrote:
> That's not the only one. For example, the hash shuffle manager has been off
> by default since Spark 1.2, and we'd like to remove it in 2.0:
> https://github.com/apache/spark/pull/12423

If I understand things correctly, Mark's option B (running a single
shuffle service, the one from the older Spark release) would still
work, wouldn't it?

You'd run into problems when Spark adds a new shuffle manager that is
not known to the old shuffle service, though. Perhaps at that time we
should investigate making the shuffle service more agnostic to the
app's shuffle manager.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcin Tustin
I'm good with option B at least until it blocks something utterly wonderful
(like shuffles are 10x faster).

On Mon, Apr 18, 2016 at 4:51 PM, Mark Grover  wrote:

> Hi all,
> If you don't use Spark on YARN, you probably don't need to read further.
>
> Here's the *user scenario*:
> There are going to be folks who may be interested in running two versions
> of Spark (say Spark 1.6.x and Spark 2.x) on the same YARN cluster.
>
> And, here's the *problem*:
> That's all fine, should work well. However, there's one problem that
> relates to the YARN shuffle service
> .
> This service is run by the YARN Node Managers on all nodes of the cluster
> that have YARN NMs as an auxillary service
> 
> .
>
> The key question here is -
> Option A:  Should the user be running 2 shuffle services - one for Spark
> 1.6.x and one for Spark 2.x?
> OR
> Option B: Should the user be running only 1 shuffle service that services
> both the Spark 1.6.x and Spark 2.x installs? This will likely have to be
> the Spark 1.6.x shuffle service (while ensuring it's forward compatible
> with Spark 2.x).
>
> *Discussion of above options:*
> A few things to note about the shuffle service:
> 1. Looking at the commit history, there aren't a whole of lot of changes
> that go into the shuffle service, rarely ones that are incompatible.
> There's only one incompatible change
>  that's been made to
> the shuffle service, as far as I can tell, and that too, seems fairly
> cosmetic.
> 2. Shuffle services for 1.6.x and 2.x serve very similar purpose (to
> provide shuffle blocks) and can easily be just one service that does it,
> even on a YARN cluster that runs both Spark 1.x and Spark 2.x.
> 3. The shuffle service is not version-spaced. This means that, the way the
> code is currently, if we were to drop the jars for Spark1 and Spark2's
> shuffle service in YARN NM's classpath, YARN NM won't be able to start both
> services. It would arbitrarily pick one service to start (based on what
> appears on the classpath first). Also, the service name is hardcoded
> 
> in Spark code and that name is also not version-spaced.
>
> Option A is arguably cleaner but it's more operational overhead and some
> code relocation/shading/version-spacing/name-spacing to make it work (due
> to #3 above), potentially to not a whole lot of value (given #2 above).
>
> Option B is simpler, lean and more operationally efficient. However, that
> requires that we as a community, keep Spark 1's shuffle service forward
> compatible with Spark 2 i.e. don't break compatibility between Spark1's and
> Spark2's shuffle service. We could even add a test (mima?) to assert that
> during the life time of Spark2. If we do go down that way, we should revert
> SPARK-12130  - the
> only backwards incompatible change made to Spark2 shuffle service so far.
>
> My personal vote goes towards Option B and I think reverting SPARK-12130
> is ok. What do others think?
>
> Thanks!
> Mark
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
That's not the only one. For example, the hash shuffle manager has been off
by default since Spark 1.2, and we'd like to remove it in 2.0:
https://github.com/apache/spark/pull/12423

How difficult it is to just change the package name to say v2?



On Mon, Apr 18, 2016 at 1:51 PM, Mark Grover  wrote:

> Hi all,
> If you don't use Spark on YARN, you probably don't need to read further.
>
> Here's the *user scenario*:
> There are going to be folks who may be interested in running two versions
> of Spark (say Spark 1.6.x and Spark 2.x) on the same YARN cluster.
>
> And, here's the *problem*:
> That's all fine, should work well. However, there's one problem that
> relates to the YARN shuffle service
> .
> This service is run by the YARN Node Managers on all nodes of the cluster
> that have YARN NMs as an auxillary service
> 
> .
>
> The key question here is -
> Option A:  Should the user be running 2 shuffle services - one for Spark
> 1.6.x and one for Spark 2.x?
> OR
> Option B: Should the user be running only 1 shuffle service that services
> both the Spark 1.6.x and Spark 2.x installs? This will likely have to be
> the Spark 1.6.x shuffle service (while ensuring it's forward compatible
> with Spark 2.x).
>
> *Discussion of above options:*
> A few things to note about the shuffle service:
> 1. Looking at the commit history, there aren't a whole of lot of changes
> that go into the shuffle service, rarely ones that are incompatible.
> There's only one incompatible change
>  that's been made to
> the shuffle service, as far as I can tell, and that too, seems fairly
> cosmetic.
> 2. Shuffle services for 1.6.x and 2.x serve very similar purpose (to
> provide shuffle blocks) and can easily be just one service that does it,
> even on a YARN cluster that runs both Spark 1.x and Spark 2.x.
> 3. The shuffle service is not version-spaced. This means that, the way the
> code is currently, if we were to drop the jars for Spark1 and Spark2's
> shuffle service in YARN NM's classpath, YARN NM won't be able to start both
> services. It would arbitrarily pick one service to start (based on what
> appears on the classpath first). Also, the service name is hardcoded
> 
> in Spark code and that name is also not version-spaced.
>
> Option A is arguably cleaner but it's more operational overhead and some
> code relocation/shading/version-spacing/name-spacing to make it work (due
> to #3 above), potentially to not a whole lot of value (given #2 above).
>
> Option B is simpler, lean and more operationally efficient. However, that
> requires that we as a community, keep Spark 1's shuffle service forward
> compatible with Spark 2 i.e. don't break compatibility between Spark1's and
> Spark2's shuffle service. We could even add a test (mima?) to assert that
> during the life time of Spark2. If we do go down that way, we should revert
> SPARK-12130  - the
> only backwards incompatible change made to Spark2 shuffle service so far.
>
> My personal vote goes towards Option B and I think reverting SPARK-12130
> is ok. What do others think?
>
> Thanks!
> Mark
>
>


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Sean Busbey
Having a PR closed, especially if due to committers not having hte
bandwidth to check on things, will be very discouraging to new folks.
Doubly so for those inexperienced with opensource. Even if the message
says "feel free to reopen for so-and-so reason", new folks who lack
confidence are going to see reopening as "pestering" and busy folks
are going to see it as a clear indication that their work is not even
valuable enough for a human to give a reason for closing. In either
case, the cost of reopening is substantially higher than that button
press.

How about we start by keeping a report of "at-risk" PRs that have been
stale for 30 days to make it easier for committers to look at the prs
that have been long inactive?

On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin  wrote:
> The cost of "reopen" is close to zero, because it is just clicking a button.
> I think you were referring to the cost of closing the pull request, and you
> are assuming people look at the pull requests that have been inactive for a
> long time. That seems equally likely (or unlikely) as committers looking at
> the recently closed pull requests.
>
> In either case, most pull requests are scanned through by us when they are
> first open, and if they are important enough, usually they get merged
> quickly or a target version is set in JIRA. We can definitely improve that
> by making it more explicit.
>
>
>
> On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu  wrote:
>>
>> From committers' perspective, would they look at closed PRs ?
>>
>> If not, the cost is not close to zero.
>> Meaning, some potentially useful PRs would never see the light of day.
>>
>> My two cents.
>>
>> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin  wrote:
>>>
>>> Part of it is how difficult it is to automate this. We can build a
>>> perfect engine with a lot of rules that understand everything. But the more
>>> complicated rules we need, the more unlikely for any of these to happen. So
>>> I'd rather do this and create a nice enough message to tell contributors
>>> sometimes mistake happen but the cost to reopen is approximately zero (i.e.
>>> click a button on the pull request).
>>>
>>>
>>> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:

 bq. close the ones where they don't respond for a week

 Does this imply that the script understands response from human ?

 Meaning, would the script use some regex which signifies that the
 contributor is willing to close the PR ?

 If the contributor is willing to close, why wouldn't he / she do it
 him/herself ?

 On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau 
 wrote:
>
> Personally I'd rather err on the side of keeping PRs open, but I
> understand wanting to keep the open PRs limited to ones which have a
> reasonable chance of being merged.
>
> What about if we filtered for non-mergeable PRs or instead left a
> comment asking the author to respond if they are still available to move 
> the
> PR forward - and close the ones where they don't respond for a week?
>
> Just a suggestion.
> On Monday, April 18, 2016, Ted Yu  wrote:
>>
>> I had one PR which got merged after 3 months.
>>
>> If the inactivity was due to contributor, I think it can be closed
>> after 30 days.
>> But if the inactivity was due to lack of review, the PR should be kept
>> open.
>>
>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
>> wrote:
>>>
>>> For what it's worth, I have definitely had PRs that sat inactive for
>>> more than 30 days due to committers not having time to look at them,
>>> but did eventually end up successfully being merged.
>>>
>>> I guess if this just ends up being a committer ping and reopening the
>>> PR, it's fine, but I don't know if it really addresses the underlying
>>> issue.
>>>
>>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
>>> wrote:
>>> > We have hit a new high in open pull requests: 469 today. While we
>>> > can
>>> > certainly get more review bandwidth, many of these are old and
>>> > still open
>>> > for other reasons. Some are stale because the original authors have
>>> > become
>>> > busy and inactive, and some others are stale because the committers
>>> > are not
>>> > sure whether the patch would be useful, but have not rejected the
>>> > patch
>>> > explicitly. We can cut down the signal to noise ratio by closing
>>> > pull
>>> > requests that have been inactive for greater than 30 days, with a
>>> > nice
>>> > message. I just checked and this would close ~ half of the pull
>>> > requests.
>>> >
>>> > For example:
>>> >
>>> > "Thank you for creating this pull request. Since this pull request

YARN Shuffle service and its compatibility

2016-04-18 Thread Mark Grover
Hi all,
If you don't use Spark on YARN, you probably don't need to read further.

Here's the *user scenario*:
There are going to be folks who may be interested in running two versions
of Spark (say Spark 1.6.x and Spark 2.x) on the same YARN cluster.

And, here's the *problem*:
That's all fine, should work well. However, there's one problem that
relates to the YARN shuffle service
.
This service is run by the YARN Node Managers on all nodes of the cluster
that have YARN NMs as an auxillary service

.

The key question here is -
Option A:  Should the user be running 2 shuffle services - one for Spark
1.6.x and one for Spark 2.x?
OR
Option B: Should the user be running only 1 shuffle service that services
both the Spark 1.6.x and Spark 2.x installs? This will likely have to be
the Spark 1.6.x shuffle service (while ensuring it's forward compatible
with Spark 2.x).

*Discussion of above options:*
A few things to note about the shuffle service:
1. Looking at the commit history, there aren't a whole of lot of changes
that go into the shuffle service, rarely ones that are incompatible.
There's only one incompatible change
 that's been made to the
shuffle service, as far as I can tell, and that too, seems fairly cosmetic.
2. Shuffle services for 1.6.x and 2.x serve very similar purpose (to
provide shuffle blocks) and can easily be just one service that does it,
even on a YARN cluster that runs both Spark 1.x and Spark 2.x.
3. The shuffle service is not version-spaced. This means that, the way the
code is currently, if we were to drop the jars for Spark1 and Spark2's
shuffle service in YARN NM's classpath, YARN NM won't be able to start both
services. It would arbitrarily pick one service to start (based on what
appears on the classpath first). Also, the service name is hardcoded

in Spark code and that name is also not version-spaced.

Option A is arguably cleaner but it's more operational overhead and some
code relocation/shading/version-spacing/name-spacing to make it work (due
to #3 above), potentially to not a whole lot of value (given #2 above).

Option B is simpler, lean and more operationally efficient. However, that
requires that we as a community, keep Spark 1's shuffle service forward
compatible with Spark 2 i.e. don't break compatibility between Spark1's and
Spark2's shuffle service. We could even add a test (mima?) to assert that
during the life time of Spark2. If we do go down that way, we should revert
SPARK-12130  - the only
backwards incompatible change made to Spark2 shuffle service so far.

My personal vote goes towards Option B and I think reverting SPARK-12130 is
ok. What do others think?

Thanks!
Mark


Re: more uniform exception handling?

2016-04-18 Thread Evan Chan
+1000.

Especially if the UI can help correlate exceptions, and we can reduce
some exceptions.

There are some exceptions which are in practice very common, such as
the nasty ClassNotFoundException, that most folks end up spending tons
of time debugging.


On Mon, Apr 18, 2016 at 12:16 PM, Reynold Xin  wrote:
> Josh's pull request on rpc exception handling got me to think ...
>
> In my experience, there have been a few things related exceptions that
> created a lot of trouble for us in production debugging:
>
> 1. Some exception is thrown, but is caught by some try/catch that does not
> do any logging nor rethrow.
> 2. Some exception is thrown, but is caught by some try/catch that does not
> do any logging, but do rethrow. But the original exception is now masked.
> 2. Multiple exceptions are logged at different places close to each other,
> but we don't know whether they are caused by the same problem or not.
>
>
> To mitigate some of the above, here's an idea ...
>
> (1) Create a common root class for all the exceptions (e.g. call it
> SparkException) used in Spark. We should make sure every time we catch an
> exception from a 3rd party library, we rethrow them as SparkException (a lot
> of places already do that). In SparkException's constructor, log the
> exception and the stacktrace.
>
> (2) SparkException has a monotonically increasing ID, and this ID appears in
> the exception error message (say at the end).
>
>
> I think (1) will eliminate most of the cases that an exception gets
> swallowed. The main downside I can think of is we might log an exception
> multiple times. However, I'd argue exceptions should be rare, and it is not
> that big of a deal to log them twice or three times. The unique ID (2) can
> help us correlate exceptions if they appear multiple times.
>
> Thoughts?
>
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
The cost of "reopen" is close to zero, because it is just clicking a
button. I think you were referring to the cost of closing the pull request,
and you are assuming people look at the pull requests that have been
inactive for a long time. That seems equally likely (or unlikely) as
committers looking at the recently closed pull requests.

In either case, most pull requests are scanned through by us when they are
first open, and if they are important enough, usually they get merged
quickly or a target version is set in JIRA. We can definitely improve that
by making it more explicit.



On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu  wrote:

> From committers' perspective, would they look at closed PRs ?
>
> If not, the cost is not close to zero.
> Meaning, some potentially useful PRs would never see the light of day.
>
> My two cents.
>
> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin  wrote:
>
>> Part of it is how difficult it is to automate this. We can build a
>> perfect engine with a lot of rules that understand everything. But the more
>> complicated rules we need, the more unlikely for any of these to happen. So
>> I'd rather do this and create a nice enough message to tell contributors
>> sometimes mistake happen but the cost to reopen is approximately zero (i.e.
>> click a button on the pull request).
>>
>>
>> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:
>>
>>> bq. close the ones where they don't respond for a week
>>>
>>> Does this imply that the script understands response from human ?
>>>
>>> Meaning, would the script use some regex which signifies that the
>>> contributor is willing to close the PR ?
>>>
>>> If the contributor is willing to close, why wouldn't he / she do it
>>> him/herself ?
>>>
>>> On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau 
>>> wrote:
>>>
 Personally I'd rather err on the side of keeping PRs open, but I
 understand wanting to keep the open PRs limited to ones which have a
 reasonable chance of being merged.

 What about if we filtered for non-mergeable PRs or instead left a
 comment asking the author to respond if they are still available to move
 the PR forward - and close the ones where they don't respond for a week?

 Just a suggestion.
 On Monday, April 18, 2016, Ted Yu  wrote:

> I had one PR which got merged after 3 months.
>
> If the inactivity was due to contributor, I think it can be closed
> after 30 days.
> But if the inactivity was due to lack of review, the PR should be kept
> open.
>
> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
> wrote:
>
>> For what it's worth, I have definitely had PRs that sat inactive for
>> more than 30 days due to committers not having time to look at them,
>> but did eventually end up successfully being merged.
>>
>> I guess if this just ends up being a committer ping and reopening the
>> PR, it's fine, but I don't know if it really addresses the underlying
>> issue.
>>
>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
>> wrote:
>> > We have hit a new high in open pull requests: 469 today. While we
>> can
>> > certainly get more review bandwidth, many of these are old and
>> still open
>> > for other reasons. Some are stale because the original authors have
>> become
>> > busy and inactive, and some others are stale because the committers
>> are not
>> > sure whether the patch would be useful, but have not rejected the
>> patch
>> > explicitly. We can cut down the signal to noise ratio by closing
>> pull
>> > requests that have been inactive for greater than 30 days, with a
>> nice
>> > message. I just checked and this would close ~ half of the pull
>> requests.
>> >
>> > For example:
>> >
>> > "Thank you for creating this pull request. Since this pull request
>> has been
>> > inactive for 30 days, we are automatically closing it. Closing the
>> pull
>> > request does not remove it from history and will retain all the
>> diff and
>> > review comments. If you have the bandwidth and would like to
>> continue
>> > pushing this forward, please reopen it. Thanks again!"
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau


>>>
>>
>


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
>From committers' perspective, would they look at closed PRs ?

If not, the cost is not close to zero.
Meaning, some potentially useful PRs would never see the light of day.

My two cents.

On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin  wrote:

> Part of it is how difficult it is to automate this. We can build a perfect
> engine with a lot of rules that understand everything. But the more
> complicated rules we need, the more unlikely for any of these to happen. So
> I'd rather do this and create a nice enough message to tell contributors
> sometimes mistake happen but the cost to reopen is approximately zero (i.e.
> click a button on the pull request).
>
>
> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:
>
>> bq. close the ones where they don't respond for a week
>>
>> Does this imply that the script understands response from human ?
>>
>> Meaning, would the script use some regex which signifies that the
>> contributor is willing to close the PR ?
>>
>> If the contributor is willing to close, why wouldn't he / she do it
>> him/herself ?
>>
>> On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau 
>> wrote:
>>
>>> Personally I'd rather err on the side of keeping PRs open, but I
>>> understand wanting to keep the open PRs limited to ones which have a
>>> reasonable chance of being merged.
>>>
>>> What about if we filtered for non-mergeable PRs or instead left a
>>> comment asking the author to respond if they are still available to move
>>> the PR forward - and close the ones where they don't respond for a week?
>>>
>>> Just a suggestion.
>>> On Monday, April 18, 2016, Ted Yu  wrote:
>>>
 I had one PR which got merged after 3 months.

 If the inactivity was due to contributor, I think it can be closed
 after 30 days.
 But if the inactivity was due to lack of review, the PR should be kept
 open.

 On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
 wrote:

> For what it's worth, I have definitely had PRs that sat inactive for
> more than 30 days due to committers not having time to look at them,
> but did eventually end up successfully being merged.
>
> I guess if this just ends up being a committer ping and reopening the
> PR, it's fine, but I don't know if it really addresses the underlying
> issue.
>
> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
> wrote:
> > We have hit a new high in open pull requests: 469 today. While we can
> > certainly get more review bandwidth, many of these are old and still
> open
> > for other reasons. Some are stale because the original authors have
> become
> > busy and inactive, and some others are stale because the committers
> are not
> > sure whether the patch would be useful, but have not rejected the
> patch
> > explicitly. We can cut down the signal to noise ratio by closing pull
> > requests that have been inactive for greater than 30 days, with a
> nice
> > message. I just checked and this would close ~ half of the pull
> requests.
> >
> > For example:
> >
> > "Thank you for creating this pull request. Since this pull request
> has been
> > inactive for 30 days, we are automatically closing it. Closing the
> pull
> > request does not remove it from history and will retain all the diff
> and
> > review comments. If you have the bandwidth and would like to continue
> > pushing this forward, please reopen it. Thanks again!"
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>
>


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
Part of it is how difficult it is to automate this. We can build a perfect
engine with a lot of rules that understand everything. But the more
complicated rules we need, the more unlikely for any of these to happen. So
I'd rather do this and create a nice enough message to tell contributors
sometimes mistake happen but the cost to reopen is approximately zero (i.e.
click a button on the pull request).


On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:

> bq. close the ones where they don't respond for a week
>
> Does this imply that the script understands response from human ?
>
> Meaning, would the script use some regex which signifies that the
> contributor is willing to close the PR ?
>
> If the contributor is willing to close, why wouldn't he / she do it
> him/herself ?
>
> On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau 
> wrote:
>
>> Personally I'd rather err on the side of keeping PRs open, but I
>> understand wanting to keep the open PRs limited to ones which have a
>> reasonable chance of being merged.
>>
>> What about if we filtered for non-mergeable PRs or instead left a comment
>> asking the author to respond if they are still available to move the PR
>> forward - and close the ones where they don't respond for a week?
>>
>> Just a suggestion.
>> On Monday, April 18, 2016, Ted Yu  wrote:
>>
>>> I had one PR which got merged after 3 months.
>>>
>>> If the inactivity was due to contributor, I think it can be closed after
>>> 30 days.
>>> But if the inactivity was due to lack of review, the PR should be kept
>>> open.
>>>
>>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
>>> wrote:
>>>
 For what it's worth, I have definitely had PRs that sat inactive for
 more than 30 days due to committers not having time to look at them,
 but did eventually end up successfully being merged.

 I guess if this just ends up being a committer ping and reopening the
 PR, it's fine, but I don't know if it really addresses the underlying
 issue.

 On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
 wrote:
 > We have hit a new high in open pull requests: 469 today. While we can
 > certainly get more review bandwidth, many of these are old and still
 open
 > for other reasons. Some are stale because the original authors have
 become
 > busy and inactive, and some others are stale because the committers
 are not
 > sure whether the patch would be useful, but have not rejected the
 patch
 > explicitly. We can cut down the signal to noise ratio by closing pull
 > requests that have been inactive for greater than 30 days, with a nice
 > message. I just checked and this would close ~ half of the pull
 requests.
 >
 > For example:
 >
 > "Thank you for creating this pull request. Since this pull request
 has been
 > inactive for 30 days, we are automatically closing it. Closing the
 pull
 > request does not remove it from history and will retain all the diff
 and
 > review comments. If you have the bandwidth and would like to continue
 > pushing this forward, please reopen it. Thanks again!"
 >
 >

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
bq. close the ones where they don't respond for a week

Does this imply that the script understands response from human ?

Meaning, would the script use some regex which signifies that the
contributor is willing to close the PR ?

If the contributor is willing to close, why wouldn't he / she do it
him/herself ?

On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau  wrote:

> Personally I'd rather err on the side of keeping PRs open, but I
> understand wanting to keep the open PRs limited to ones which have a
> reasonable chance of being merged.
>
> What about if we filtered for non-mergeable PRs or instead left a comment
> asking the author to respond if they are still available to move the PR
> forward - and close the ones where they don't respond for a week?
>
> Just a suggestion.
> On Monday, April 18, 2016, Ted Yu  wrote:
>
>> I had one PR which got merged after 3 months.
>>
>> If the inactivity was due to contributor, I think it can be closed after
>> 30 days.
>> But if the inactivity was due to lack of review, the PR should be kept
>> open.
>>
>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
>> wrote:
>>
>>> For what it's worth, I have definitely had PRs that sat inactive for
>>> more than 30 days due to committers not having time to look at them,
>>> but did eventually end up successfully being merged.
>>>
>>> I guess if this just ends up being a committer ping and reopening the
>>> PR, it's fine, but I don't know if it really addresses the underlying
>>> issue.
>>>
>>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
>>> wrote:
>>> > We have hit a new high in open pull requests: 469 today. While we can
>>> > certainly get more review bandwidth, many of these are old and still
>>> open
>>> > for other reasons. Some are stale because the original authors have
>>> become
>>> > busy and inactive, and some others are stale because the committers
>>> are not
>>> > sure whether the patch would be useful, but have not rejected the patch
>>> > explicitly. We can cut down the signal to noise ratio by closing pull
>>> > requests that have been inactive for greater than 30 days, with a nice
>>> > message. I just checked and this would close ~ half of the pull
>>> requests.
>>> >
>>> > For example:
>>> >
>>> > "Thank you for creating this pull request. Since this pull request has
>>> been
>>> > inactive for 30 days, we are automatically closing it. Closing the pull
>>> > request does not remove it from history and will retain all the diff
>>> and
>>> > review comments. If you have the bandwidth and would like to continue
>>> > pushing this forward, please reopen it. Thanks again!"
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Marcin Tustin
+1 and at the same time maybe surface a report to this list of PRs which
need committer action and have only had submitters responding to pings in
the last 30 days?

On Mon, Apr 18, 2016 at 3:33 PM, Holden Karau  wrote:

> Personally I'd rather err on the side of keeping PRs open, but I
> understand wanting to keep the open PRs limited to ones which have a
> reasonable chance of being merged.
>
> What about if we filtered for non-mergeable PRs or instead left a comment
> asking the author to respond if they are still available to move the PR
> forward - and close the ones where they don't respond for a week?
>
> Just a suggestion.
>
> On Monday, April 18, 2016, Ted Yu  wrote:
>
>> I had one PR which got merged after 3 months.
>>
>> If the inactivity was due to contributor, I think it can be closed after
>> 30 days.
>> But if the inactivity was due to lack of review, the PR should be kept
>> open.
>>
>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
>> wrote:
>>
>>> For what it's worth, I have definitely had PRs that sat inactive for
>>> more than 30 days due to committers not having time to look at them,
>>> but did eventually end up successfully being merged.
>>>
>>> I guess if this just ends up being a committer ping and reopening the
>>> PR, it's fine, but I don't know if it really addresses the underlying
>>> issue.
>>>
>>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
>>> wrote:
>>> > We have hit a new high in open pull requests: 469 today. While we can
>>> > certainly get more review bandwidth, many of these are old and still
>>> open
>>> > for other reasons. Some are stale because the original authors have
>>> become
>>> > busy and inactive, and some others are stale because the committers
>>> are not
>>> > sure whether the patch would be useful, but have not rejected the patch
>>> > explicitly. We can cut down the signal to noise ratio by closing pull
>>> > requests that have been inactive for greater than 30 days, with a nice
>>> > message. I just checked and this would close ~ half of the pull
>>> requests.
>>> >
>>> > For example:
>>> >
>>> > "Thank you for creating this pull request. Since this pull request has
>>> been
>>> > inactive for 30 days, we are automatically closing it. Closing the pull
>>> > request does not remove it from history and will retain all the diff
>>> and
>>> > review comments. If you have the bandwidth and would like to continue
>>> > pushing this forward, please reopen it. Thanks again!"
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Holden Karau
Personally I'd rather err on the side of keeping PRs open, but I understand
wanting to keep the open PRs limited to ones which have a reasonable chance
of being merged.

What about if we filtered for non-mergeable PRs or instead left a comment
asking the author to respond if they are still available to move the PR
forward - and close the ones where they don't respond for a week?

Just a suggestion.
On Monday, April 18, 2016, Ted Yu  wrote:

> I had one PR which got merged after 3 months.
>
> If the inactivity was due to contributor, I think it can be closed after
> 30 days.
> But if the inactivity was due to lack of review, the PR should be kept
> open.
>
> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger  > wrote:
>
>> For what it's worth, I have definitely had PRs that sat inactive for
>> more than 30 days due to committers not having time to look at them,
>> but did eventually end up successfully being merged.
>>
>> I guess if this just ends up being a committer ping and reopening the
>> PR, it's fine, but I don't know if it really addresses the underlying
>> issue.
>>
>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin > > wrote:
>> > We have hit a new high in open pull requests: 469 today. While we can
>> > certainly get more review bandwidth, many of these are old and still
>> open
>> > for other reasons. Some are stale because the original authors have
>> become
>> > busy and inactive, and some others are stale because the committers are
>> not
>> > sure whether the patch would be useful, but have not rejected the patch
>> > explicitly. We can cut down the signal to noise ratio by closing pull
>> > requests that have been inactive for greater than 30 days, with a nice
>> > message. I just checked and this would close ~ half of the pull
>> requests.
>> >
>> > For example:
>> >
>> > "Thank you for creating this pull request. Since this pull request has
>> been
>> > inactive for 30 days, we are automatically closing it. Closing the pull
>> > request does not remove it from history and will retain all the diff and
>> > review comments. If you have the bandwidth and would like to continue
>> > pushing this forward, please reopen it. Thanks again!"
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
Cody,

Thanks for commenting. "inactive" here means no code push nor comments. So
any "ping" would actually keep the pr in the open queue. Getting
auto-closed also by no means indicate the pull request can't be reopened.

On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger  wrote:

> For what it's worth, I have definitely had PRs that sat inactive for
> more than 30 days due to committers not having time to look at them,
> but did eventually end up successfully being merged.
>
> I guess if this just ends up being a committer ping and reopening the
> PR, it's fine, but I don't know if it really addresses the underlying
> issue.
>
> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
> > We have hit a new high in open pull requests: 469 today. While we can
> > certainly get more review bandwidth, many of these are old and still open
> > for other reasons. Some are stale because the original authors have
> become
> > busy and inactive, and some others are stale because the committers are
> not
> > sure whether the patch would be useful, but have not rejected the patch
> > explicitly. We can cut down the signal to noise ratio by closing pull
> > requests that have been inactive for greater than 30 days, with a nice
> > message. I just checked and this would close ~ half of the pull requests.
> >
> > For example:
> >
> > "Thank you for creating this pull request. Since this pull request has
> been
> > inactive for 30 days, we are automatically closing it. Closing the pull
> > request does not remove it from history and will retain all the diff and
> > review comments. If you have the bandwidth and would like to continue
> > pushing this forward, please reopen it. Thanks again!"
> >
> >
>


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
I had one PR which got merged after 3 months.

If the inactivity was due to contributor, I think it can be closed after 30
days.
But if the inactivity was due to lack of review, the PR should be kept open.

On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger  wrote:

> For what it's worth, I have definitely had PRs that sat inactive for
> more than 30 days due to committers not having time to look at them,
> but did eventually end up successfully being merged.
>
> I guess if this just ends up being a committer ping and reopening the
> PR, it's fine, but I don't know if it really addresses the underlying
> issue.
>
> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
> > We have hit a new high in open pull requests: 469 today. While we can
> > certainly get more review bandwidth, many of these are old and still open
> > for other reasons. Some are stale because the original authors have
> become
> > busy and inactive, and some others are stale because the committers are
> not
> > sure whether the patch would be useful, but have not rejected the patch
> > explicitly. We can cut down the signal to noise ratio by closing pull
> > requests that have been inactive for greater than 30 days, with a nice
> > message. I just checked and this would close ~ half of the pull requests.
> >
> > For example:
> >
> > "Thank you for creating this pull request. Since this pull request has
> been
> > inactive for 30 days, we are automatically closing it. Closing the pull
> > request does not remove it from history and will retain all the diff and
> > review comments. If you have the bandwidth and would like to continue
> > pushing this forward, please reopen it. Thanks again!"
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


more uniform exception handling?

2016-04-18 Thread Reynold Xin
Josh's pull request  on rpc
exception handling got me to think ...

In my experience, there have been a few things related exceptions that
created a lot of trouble for us in production debugging:

1. Some exception is thrown, but is caught by some try/catch that does not
do any logging nor rethrow.
2. Some exception is thrown, but is caught by some try/catch that does not
do any logging, but do rethrow. But the original exception is now masked.
2. Multiple exceptions are logged at different places close to each other,
but we don't know whether they are caused by the same problem or not.


To mitigate some of the above, here's an idea ...

(1) Create a common root class for all the exceptions (e.g. call it
SparkException) used in Spark. We should make sure every time we catch an
exception from a 3rd party library, we rethrow them as SparkException (a
lot of places already do that). In SparkException's constructor, log the
exception and the stacktrace.

(2) SparkException has a monotonically increasing ID, and this ID appears
in the exception error message (say at the end).


I think (1) will eliminate most of the cases that an exception gets
swallowed. The main downside I can think of is we might log an exception
multiple times. However, I'd argue exceptions should be rare, and it is not
that big of a deal to log them twice or three times. The unique ID (2) can
help us correlate exceptions if they appear multiple times.

Thoughts?


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Cody Koeninger
For what it's worth, I have definitely had PRs that sat inactive for
more than 30 days due to committers not having time to look at them,
but did eventually end up successfully being merged.

I guess if this just ends up being a committer ping and reopening the
PR, it's fine, but I don't know if it really addresses the underlying
issue.

On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
> We have hit a new high in open pull requests: 469 today. While we can
> certainly get more review bandwidth, many of these are old and still open
> for other reasons. Some are stale because the original authors have become
> busy and inactive, and some others are stale because the committers are not
> sure whether the patch would be useful, but have not rejected the patch
> explicitly. We can cut down the signal to noise ratio by closing pull
> requests that have been inactive for greater than 30 days, with a nice
> message. I just checked and this would close ~ half of the pull requests.
>
> For example:
>
> "Thank you for creating this pull request. Since this pull request has been
> inactive for 30 days, we are automatically closing it. Closing the pull
> request does not remove it from history and will retain all the diff and
> review comments. If you have the bandwidth and would like to continue
> pushing this forward, please reopen it. Thanks again!"
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



inter spark application communication

2016-04-18 Thread Soumitra Johri
Hi,

I have two applications : App1 and App2.
On a single cluster I have to spawn 5 instances os App1 and 1 instance of
App2.

What would be the best way to send data from the 5 App1 instances to the
single App2 instance ?

Right now I am using Kafka to send data from one spark application to the
spark application  but the setup doesn't seem right and I hope there is a
better way to do this.

Warm Regards
Soumitra


auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
We have hit a new high in open pull requests: 469 today. While we can
certainly get more review bandwidth, many of these are old and still open
for other reasons. Some are stale because the original authors have become
busy and inactive, and some others are stale because the committers are not
sure whether the patch would be useful, but have not rejected the patch
explicitly. We can cut down the signal to noise ratio by closing pull
requests that have been inactive for greater than 30 days, with a nice
message. I just checked and this would close ~ half of the pull requests.

For example:

"Thank you for creating this pull request. Since this pull request has been
inactive for 30 days, we are automatically closing it. Closing the pull
request does not remove it from history and will retain all the diff and
review comments. If you have the bandwidth and would like to continue
pushing this forward, please reopen it. Thanks again!"


More elaborate toString for StreamExecution?

2016-04-18 Thread Jacek Laskowski
Hi,

I'd love having a more elaborate toString to StreamExecution:

scala> sqlContext.streams.active.foreach(println)
Continuous Query - memStream [state = ACTIVE]
Continuous Query - hello2 [state = ACTIVE]
Continuous Query - hello [state = ACTIVE]

Any work in this area? trigger is something it could have.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-18 Thread Luciano Resende
Evan,

As long as you meet the criteria we discussed on this thread, you are
welcome to join.

Having said that, I have already seen other contributors that are very
active on some of connectors but are not Apache Committers yet, and i
wanted to be fair, and also avoid using the project as an avenue to bring
new committers to Apache.


On Sun, Apr 17, 2016 at 10:07 PM, Evan Chan  wrote:

> Hi Luciano,
>
> I see that you are inviting all the Spark committers to this new project.
> What about the chief maintainers of important Spark ecosystem projects,
> which are not on the Spark PMC?
>
> For example, I am the chief maintainer of the Spark Job Server, which is
> one of the most active projects in the larger Spark ecosystem.  Would
> projects like this be part of your vision?   If so, it would be a good step
> of faith to reach out to us that maintain the active ecosystem projects.
>  (I’m not saying you should put me in :)  but rather suggesting that if
> this is your aim, it would be good to reach out beyond just the Spark PMC
> members.
>
> thanks,
> Evan
>
> On Apr 17, 2016, at 9:16 AM, Luciano Resende  wrote:
>
>
>
> On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin  wrote:
>
>> First, really thank you for leading the discussion.
>>
>> I am concerned that it'd hurt Spark more than it helps. As many others
>> have pointed out, this unnecessarily creates a new tier of connectors or
>> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
>> We can alleviate this concern by not having "Spark" in the name, and the
>> project proposal and documentation should label clearly that this is not
>> affiliated with Spark.
>>
>
> I really thought we could use the Spark name (e.g. similar to
> spark-packages) as this project is really aligned and dedicated to curating
> extensions to Apache Spark and that's why we were inviting Spark PMC
> members to join the new project PMC so that Apache Spark has the necessary
> oversight and influence on the project direction. I understand folks have
> concerns with the name, and thus we will start looking into name
> alternatives unless there is any way I could address the community concerns
> around this.
>
>
>>
>> Also Luciano - assuming you are interested in creating a project like
>> this and find a home for the connectors that were removed, I find it
>> surprising that few of the initially proposed PMC members have actually
>> contributed much to the connectors, and people that have contributed a lot
>> were left out. I am sure that is just an oversight.
>>
>>
> Reynold, thanks for your concern, we are not leaving anyone out, we took
> the following criteria to identify initial PMC/Committers list as described
> on the first e-mail on this thread:
>
>- Spark Committers and Apache Members can request to participate as PMC
> members
>- All active spark committers (committed on the last one year) will
> have write access to the project (committer access)
>- Other committers can request to become committers.
>- Non committers would be added based on meritocracy after the start of
> the project.
>
> Based on this criteria, all people that have expressed interest in joining
> the project PMC has been added to it, but I don't feel comfortable adding
> names to it at my will. And I have updated the list of committers and
> currently we have the following on the draft proposal:
>
>
> Initial PMC
>
>
>- Luciano Resende (lresende AT apache DOT org) (Apache Member)
>- Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
>board member)
>- Steve Loughran (stevel AT apache DOT org) (Apache Member)
>- Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
>- Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
>committer)
>- Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
>- Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)
>
>
> Initial Committers (write access to active Spark committers that have
> committed in the last one year)
>
>
>- Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
>- Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
>- Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
>- Davies Liu (davies AT apache DOT org) (Apache Spark)
>- DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
>- Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
>- Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
>- Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark)
>- Imran Rashid (irashid AT apache DOT org) (Apache Spark)
>- Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark)
>- Josh Rosen (joshrosen AT apache DOT org) (Apache Spark)
>- Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark)
>- Cheng Lian (lian AT apache DOT org) (Apache Spark)
>- Mark Hamstra (markhamstra AT 

Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Reynold Xin
Nope. It is unclear whether they would be useful enough or not. But when
designing APIs we always need to anticipate future changes.

On Monday, April 18, 2016, Jacek Laskowski  wrote:

> When you say "in the future", do you have any specific timeframe in
> mind? You got me curious :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Apr 18, 2016 at 7:44 PM, Reynold Xin  > wrote:
> > The problem with this is that we might introduce event time based
> trigger in
> > the future, and then it would be more confusing...
> >
> >
> > On Monday, April 18, 2016, Jacek Laskowski  > wrote:
> >>
> >> Hi,
> >>
> >> While working with structured streaming (aka SparkSQL Streams :)) I
> >> thought about adding
> >>
> >> implicit def toProcessingTime(duration: Duration) =
> >> ProcessingTime(duration)
> >>
> >> What do you think?
> >>
> >> I think it'd improve the API:
> >>
> >> .trigger(ProcessingTime(10 seconds))
> >>
> >> vs
> >>
> >> .trigger(10 seconds)
> >>
> >> (since it's not a release feature I didn't mean to file an issue in
> >> JIRA - please guide if needed).
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >>
> >
>


Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Andrew Ray
While you can't automatically push the limit *through* the join, we could
push it *into* the join (stop processing after generating 10 records). I
believe that is what Rajesh is suggesting.

On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> I am not sure if you can push a limit through a join. This becomes
> problematic if not all keys are present on both sides; in such a case a
> limit can produce fewer rows than the set limit.
>
> This might be a rare case in which whole stage codegen is slower, due to
> the fact that we need to buffer the result of such a stage. You could try
> to disable it by setting "spark.sql.codegen.wholeStage" to false.
>
> 2016-04-12 14:32 GMT+02:00 Rajesh Balamohan :
>
>> Hi,
>>
>> I ran the following query in spark (latest master codebase) and it took a
>> lot of time to complete even though it was a broadcast hash join.
>>
>> It appears that limit computation is done only after computing complete
>> join condition.  Shouldn't the limit condition be pushed to
>> BroadcastHashJoin (wherein it would have to stop processing after
>> generating 10 rows?).  Please let me know if my understanding on this is
>> wrong.
>>
>>
>> select l_partkey from lineitem, partsupp where ps_partkey=l_partkey limit
>> 10;
>>
>> 
>> | == Physical Plan ==
>> CollectLimit 10
>> +- WholeStageCodegen
>>:  +- Project [l_partkey#893]
>>: +- BroadcastHashJoin [l_partkey#893], [ps_partkey#908], Inner,
>> BuildRight, None
>>::- Project [l_partkey#893]
>>::  +- Filter isnotnull(l_partkey#893)
>>:: +- Scan HadoopFiles[l_partkey#893] Format: ORC,
>> PushedFilters: [IsNotNull(l_partkey)], ReadSchema: struct
>>:+- INPUT
>>+- BroadcastExchange
>> HashedRelationBroadcastMode(true,List(cast(ps_partkey#908 as
>> bigint)),List(ps_partkey#908))
>>   +- WholeStageCodegen
>>  :  +- Project [ps_partkey#908]
>>  : +- Filter isnotnull(ps_partkey#908)
>>  :+- Scan HadoopFiles[ps_partkey#908] Format: ORC,
>> PushedFilters: [IsNotNull(ps_partkey)], ReadSchema: struct
>>  |
>> 
>>
>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>


Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Jacek Laskowski
When you say "in the future", do you have any specific timeframe in
mind? You got me curious :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Apr 18, 2016 at 7:44 PM, Reynold Xin  wrote:
> The problem with this is that we might introduce event time based trigger in
> the future, and then it would be more confusing...
>
>
> On Monday, April 18, 2016, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> While working with structured streaming (aka SparkSQL Streams :)) I
>> thought about adding
>>
>> implicit def toProcessingTime(duration: Duration) =
>> ProcessingTime(duration)
>>
>> What do you think?
>>
>> I think it'd improve the API:
>>
>> .trigger(ProcessingTime(10 seconds))
>>
>> vs
>>
>> .trigger(10 seconds)
>>
>> (since it's not a release feature I didn't mean to file an issue in
>> JIRA - please guide if needed).
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Reynold Xin
The problem with this is that we might introduce event time based trigger
in the future, and then it would be more confusing...

On Monday, April 18, 2016, Jacek Laskowski  wrote:

> Hi,
>
> While working with structured streaming (aka SparkSQL Streams :)) I
> thought about adding
>
> implicit def toProcessingTime(duration: Duration) =
> ProcessingTime(duration)
>
> What do you think?
>
> I think it'd improve the API:
>
> .trigger(ProcessingTime(10 seconds))
>
> vs
>
> .trigger(10 seconds)
>
> (since it's not a release feature I didn't mean to file an issue in
> JIRA - please guide if needed).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>


Re: [build system] issue w/jenkins

2016-04-18 Thread shane knapp
somehow DNS, internal to berkeley, got borked and the redirect failed.
we've hard-coded in some entries in to /etc/hosts, and re-ordered our
nameservers, and are still trying to figure out what happened.

anyways, we're back:
https://amplab.cs.berkeley.edu/jenkins/

On Mon, Apr 18, 2016 at 10:22 AM, shane knapp  wrote:
> for now, you can log in to jenkins by ignoring the http reverse proxy:
> https://hadrian.ist.berkeley.edu/jenkins/
>
> this still doesn't allow for things like the pull request builder and
> whatnot to run...  i'm still digging in to this.
>
> thanks,
>
> shane
>
> On Mon, Apr 18, 2016 at 10:02 AM, shane knapp  wrote:
>> right now it looks like we're having problems connection to jenkins
>> through our firewall.  i'm currently looking in to this and will let
>> everyone know immediately when it's been resolved.
>>
>> thanks in advance for your patience...
>>
>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Jacek Laskowski
Hi,

While working with structured streaming (aka SparkSQL Streams :)) I
thought about adding

implicit def toProcessingTime(duration: Duration) = ProcessingTime(duration)

What do you think?

I think it'd improve the API:

.trigger(ProcessingTime(10 seconds))

vs

.trigger(10 seconds)

(since it's not a release feature I didn't mean to file an issue in
JIRA - please guide if needed).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] issue w/jenkins

2016-04-18 Thread shane knapp
for now, you can log in to jenkins by ignoring the http reverse proxy:
https://hadrian.ist.berkeley.edu/jenkins/

this still doesn't allow for things like the pull request builder and
whatnot to run...  i'm still digging in to this.

thanks,

shane

On Mon, Apr 18, 2016 at 10:02 AM, shane knapp  wrote:
> right now it looks like we're having problems connection to jenkins
> through our firewall.  i'm currently looking in to this and will let
> everyone know immediately when it's been resolved.
>
> thanks in advance for your patience...
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BytesToBytes and unaligned memory

2016-04-18 Thread Ted Yu
bq. run the tests claiming to require unaligned memory access on a platform
where unaligned memory access is definitely not supported for
shorts/ints/longs.

That would help us understand interactions on s390x platform better.

On Mon, Apr 18, 2016 at 6:49 AM, Adam Roberts  wrote:

> Ted, yes with the forced true value all tests pass, we use the unaligned
> check in 15 other suites.
>
> Our java.nio.Bits.unaligned() function checks that the detected os.arch
> value matches a list of known implementations (not including s390x).
>
> We could add it to the known architectures in the catch block but this
> won't make a difference here as because we call unaligned() OK (no
> exception is thrown), we don't reach the architecture checking stage anyway.
>
> I see in org.apache.spark.memory.MemoryManager that unaligned support is
> required for off-heap memory in Tungsten (perhaps incorrectly if no code
> ever exercises it in Spark?). Instead of having a requirement should we
> instead log a warning once that this is likely to lead to slow performance?
> What's the rationale for supporting unaligned memory access: it's my
> understanding that it's typically very slow, are there any design docs or
> perhaps a JIRA where I can learn more?
>
> Will run a simple test case exercising unaligned memory access for Linux
> on Z (without using Spark) and can also run the tests claiming to require
> unaligned memory access on a platform where unaligned memory access is
> definitely not supported for shorts/ints/longs.
>
> if these tests continue to pass then I think the Spark tests don't
> exercise unaligned memory access, cheers
>
>
>
>
>
>
>
> From:Ted Yu 
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"dev@spark.apache.org" 
> Date:15/04/2016 17:35
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> I am curious if all Spark unit tests pass with the forced true value for
> unaligned.
> If that is the case, it seems we can add s390x to the known architectures.
>
> It would also give us some more background if you can describe
> how java.nio.Bits#unaligned() is implemented on s390x.
>
> Josh / Andrew / Davies / Ryan are more familiar with related code. It
> would be good to hear what they think.
>
> Thanks
>
> On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts <*arobe...@uk.ibm.com*
> > wrote:
> Ted, yeah with the forced true value the tests in that suite all pass and
> I know they're being executed thanks to prints I've added
>
> Cheers,
>
>
>
>
> From:Ted Yu <*yuzhih...@gmail.com* >
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"*dev@spark.apache.org* " <
> *dev@spark.apache.org* >
> Date:15/04/2016 16:43
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with
> the forced true value for unaligned ?
>
> If the test failed, please pastebin the failure(s).
>
> Thanks
>
> On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts <*arobe...@uk.ibm.com*
> > wrote:
> Ted, yep I'm working from the latest code which includes that unaligned
> check, for experimenting I've modified that code to ignore the unaligned
> check (just go ahead and say we support it anyway, even though our JDK
> returns false: the return value of java.nio.Bits.unaligned()).
>
> My Platform.java for testing contains:
>
> private static final boolean unaligned;
>
> static {
>   boolean _unaligned;
>   // use reflection to access unaligned field
>   try {
> * System.out.println("Checking unaligned support");*
> Class bitsClass =
>   Class.forName("java.nio.Bits", false,
> ClassLoader.getSystemClassLoader());
> Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
> unalignedMethod.setAccessible(true);
> _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
> *System.out.println("Used reflection and _unaligned is: " +
> _unaligned);*
> * System.out.println("Setting to true anyway for experimenting");*
> * _unaligned = true;*
> } catch (Throwable t) {
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
> *   // We don't actually get here since we find the unaligned method
> OK and it returns false (I override with true anyway)*
> *   // but add s390x incase we somehow fail anyway.*
> *   System.out.println("Checking for s390x, os.arch is: " + arch);*
> *   _unaligned =
> arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$");*
> }
> unaligned = _unaligned;
> * System.out.println("returning: " + unaligned);*
>   }
> }
>
> Output is, as you'd expect, "used reflection and _unaligned is 

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Adam Roberts
Ted, yes with the forced true value all tests pass, we use the unaligned 
check in 15 other suites.

Our java.nio.Bits.unaligned() function checks that the detected os.arch 
value matches a list of known implementations (not including s390x).

We could add it to the known architectures in the catch block but this 
won't make a difference here as because we call unaligned() OK (no 
exception is thrown), we don't reach the architecture checking stage 
anyway.

I see in org.apache.spark.memory.MemoryManager that unaligned support is 
required for off-heap memory in Tungsten (perhaps incorrectly if no code 
ever exercises it in Spark?). Instead of having a requirement should we 
instead log a warning once that this is likely to lead to slow 
performance? What's the rationale for supporting unaligned memory access: 
it's my understanding that it's typically very slow, are there any design 
docs or perhaps a JIRA where I can learn more? 

Will run a simple test case exercising unaligned memory access for Linux 
on Z (without using Spark) and can also run the tests claiming to require 
unaligned memory access on a platform where unaligned memory access is 
definitely not supported for shorts/ints/longs. 

if these tests continue to pass then I think the Spark tests don't 
exercise unaligned memory access, cheers


 




From:   Ted Yu 
To: Adam Roberts/UK/IBM@IBMGB
Cc: "dev@spark.apache.org" 
Date:   15/04/2016 17:35
Subject:Re: BytesToBytes and unaligned memory



I am curious if all Spark unit tests pass with the forced true value for 
unaligned.
If that is the case, it seems we can add s390x to the known architectures.

It would also give us some more background if you can describe 
how java.nio.Bits#unaligned() is implemented on s390x.

Josh / Andrew / Davies / Ryan are more familiar with related code. It 
would be good to hear what they think.

Thanks

On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts  wrote:
Ted, yeah with the forced true value the tests in that suite all pass and 
I know they're being executed thanks to prints I've added 

Cheers, 




From:Ted Yu  
To:Adam Roberts/UK/IBM@IBMGB 
Cc:"dev@spark.apache.org"  
Date:15/04/2016 16:43 
Subject:Re: BytesToBytes and unaligned memory 



Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with 
the forced true value for unaligned ? 

If the test failed, please pastebin the failure(s). 

Thanks 

On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts  wrote: 

Ted, yep I'm working from the latest code which includes that unaligned 
check, for experimenting I've modified that code to ignore the unaligned 
check (just go ahead and say we support it anyway, even though our JDK 
returns false: the return value of java.nio.Bits.unaligned()). 

My Platform.java for testing contains: 

private static final boolean unaligned; 

static { 
  boolean _unaligned; 
  // use reflection to access unaligned field 
  try { 
System.out.println("Checking unaligned support"); 
Class bitsClass = 
  Class.forName("java.nio.Bits", false, 
ClassLoader.getSystemClassLoader()); 
Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned"); 
unalignedMethod.setAccessible(true); 
_unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null)); 
System.out.println("Used reflection and _unaligned is: " + 
_unaligned); 
System.out.println("Setting to true anyway for experimenting"); 
_unaligned = true; 
} catch (Throwable t) { 
  // We at least know x86 and x64 support unaligned access. 
  String arch = System.getProperty("os.arch", ""); 
  //noinspection DynamicRegexReplaceableByCompiledPattern 
  // We don't actually get here since we find the unaligned method OK 
and it returns false (I override with true anyway) 
  // but add s390x incase we somehow fail anyway. 
  System.out.println("Checking for s390x, os.arch is: " + arch); 
  _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$"); 

} 
unaligned = _unaligned; 
System.out.println("returning: " + unaligned); 
  } 
} 

Output is, as you'd expect, "used reflection and _unaligned is false, 
setting to true anyway for experimenting", and the tests pass. 

No other problems on the platform (pending a different pull request). 

Cheers, 







From:Ted Yu  
To:Adam Roberts/UK/IBM@IBMGB 
Cc:"dev@spark.apache.org"  
Date:15/04/2016 15:32 
Subject:Re: BytesToBytes and unaligned memory 




I assume you tested 2.0 with SPARK-12181 . 

Related code from Platform.java if java.nio.Bits#unaligned() throws 
exception: 

  // We at least know x86 and x64 support unaligned access. 
  String arch = System.getProperty("os.arch", ""); 
  //noinspection 

Re: Code freeze?

2016-04-18 Thread Sean Owen
FWIW, here's what I do to look at JIRA's answer to this:

1) Go download http://almworks.com/jiraclient/overview.html
2) Set up a query for "target = 2.0.0 and status = Open, In Progress, Reopened"
3) Set up sub-queries for bugs vs non-bugs, and for critical, blocker and other

Right now there are 172 issues open for 2.0.0. 40 are bugs, 4 of which
are critical and 1 of which is a blocker. 9 non-bugs are blockers, 5
critical.

JIRA info is inevitably noisy, but now is a good time to make this
info meaningful so we have some shared reference about the short-term
plan.

What I suggest we do now is ...

a) un-target anything that wasn't targeted to 2.0.0 by a committer
b) committers un-target or re-target anything they know isn't that
important for 2.0.0 (thanks jkbradley)
c) focus on bugs > features, high priority > low priority this week
d) see where we are next week, repeat

I suggest we simply have "no blockers" as an exit criteria, with a
strong pref for "no critical bugs either".

It's a major release, so taking a little extra time to get it all done
comfortably is both possible and unusually important. A couple weeks
indeed might be realistic for an RC, but it really depends on burndown
more than anything.

On Mon, Apr 18, 2016 at 8:23 AM, Pete Robbins  wrote:
> Is there a list of Jiras to be considered for 2.0? I would really like to
> get https://issues.apache.org/jira/browse/SPARK-13745 in so that Big Endian
> platforms are not broken.
>
> Cheers,
>
> On Wed, 13 Apr 2016 at 08:51 Reynold Xin  wrote:
>>
>> I think the main things are API things that we need to get right.
>>
>> - Implement essential DDLs
>> https://issues.apache.org/jira/browse/SPARK-14118  this blocks the next one
>>
>> - Merge HiveContext and SQLContext and create SparkSession
>> https://issues.apache.org/jira/browse/SPARK-13485
>>
>> - Separate out local linear algebra as a standalone module without Spark
>> dependency https://issues.apache.org/jira/browse/SPARK-13944
>>
>> - Run Spark without assembly jars (mostly done?)
>>
>>
>> Probably realistic to have it in ~ 2 weeks.
>>
>>
>>
>> On Wed, Apr 13, 2016 at 12:45 AM, Sean Owen  wrote:
>>>
>>> I've heard several people refer to a code freeze for 2.0. Unless I missed
>>> it, nobody has discussed a particular date for this:
>>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>>
>>> I'd like to start with a review of JIRAs before anyone decides a freeze
>>> is appropriate. There are hundreds of issues, some blockers, still targeted
>>> for 2.0. Probably best for everyone to review and retarget non essentials
>>> and then see where we are at?
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Code freeze?

2016-04-18 Thread Pete Robbins
Is there a list of Jiras to be considered for 2.0? I would really like to
get https://issues.apache.org/jira/browse/SPARK-13745 in so that Big Endian
platforms are not broken.

Cheers,

On Wed, 13 Apr 2016 at 08:51 Reynold Xin  wrote:

> I think the main things are API things that we need to get right.
>
> - Implement essential DDLs
> https://issues.apache.org/jira/browse/SPARK-14118  this blocks the next
> one
>
> - Merge HiveContext and SQLContext and create SparkSession
> https://issues.apache.org/jira/browse/SPARK-13485
>
> - Separate out local linear algebra as a standalone module without Spark
> dependency https://issues.apache.org/jira/browse/SPARK-13944
>
> - Run Spark without assembly jars (mostly done?)
>
>
> Probably realistic to have it in ~ 2 weeks.
>
>
>
> On Wed, Apr 13, 2016 at 12:45 AM, Sean Owen  wrote:
>
>> I've heard several people refer to a code freeze for 2.0. Unless I missed
>> it, nobody has discussed a particular date for this:
>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>
>> I'd like to start with a review of JIRAs before anyone decides a freeze
>> is appropriate. There are hundreds of issues, some blockers, still targeted
>> for 2.0. Probably best for everyone to review and retarget non essentials
>> and then see where we are at?
>>
>
>